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O ■ Abstract 

^ ' Sequential decision theory formally solves the problem of rational agents in 

uncertain worlds if the true environmental prior probability distribution is 
^ ' known. Solomonoff 's theory of universal induction formally solves the problem 

of sequence prediction for unknown prior distribution. We combine both 
ideas and get a parameter-free theory of universal Artificial Intelligence. We 
give strong arguments that the resulting AIXI model is the most intelligent 
unbiased agent possible. We outline how the AIXI model can formally solve 
a number of problem classes, including sequence prediction, strategic games, 
function minimization, reinforcement and supervised learning. The major 
drawback of the AIXI model is that it is uncomputable. To overcome this 
problem, we construct a modified algorithm AlXltl that is still effectively 
more intelligent than any other time t and length / bounded agent. The 
computation time of AlXItl is of the order t-2^ . The discussion includes formal 
definitions of intelligence order relations, the horizon problem and relations 
of the AIXI theory to other AI approaches. 
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1 Introduction 

This article gives an introduction to a mathematical theory for intelligence. We 
present the AIXI model, a parameter-free optimal reinforcement learning agent em- 
bedded in an arbitrary unknown environment. 

The science of Artificial Intelligence (AI) may be defined as the construction of 
intelligent systems and their analysis. A natural definition of a system is anything 
that has an input and an output stream. Intelligence is more complicated. It can 
have many faces like creativity, solving problems, pattern recognition, classification, 
learning, induction, deduction, building analogies, optimization, surviving in an 
environment, language processing, knowledge and many more. A formal definition 
incorporating every aspect of intelligence, however, seems difficult. Most, if not all 
known facets of intelligence can be formulated as goal-driven or, more precisely, as 
maximizing some utility function. It is, therefore, sufficient to study goal-driven AI; 
e.g. the (biological) goal of animals and humans is to survive and spread. The goal 
of AI systems should be to be useful to humans. The problem is that, except for 
special cases, we know neither the utility function nor the environment in which the 
agent will operate in advance. The mathematical theory, coined AIXI, is supposed 
to solve these problems. 

Assume the availability of unlimited computational resources. The first impor- 
tant observation is that this does not make the AI problem trivial. Playing chess 
optimally or solving NP-complete problems become trivial, but driving a car or 
surviving in nature don't. This is because it is a challenge itself to well-define the 
latter problems, not to mention presenting an algorithm. In other words: The AI 
problem has not yet been well defined. One may view AIXI ggestion for such 

a mathematical definition of AI. 

AIXI is a universal theory of sequential decision making akin to Solomonoff's 
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celebrated universal theory of induction. Solomonoff derived an optimal way of pre- 
dicting future data, given previous perceptions, provided the data is sampled from 
a computable probability distribution. AIXI extends this approach to an optimal 
decision making agent embedded in an unknown environment. The main idea is 
to replace the unknown environmental distribution n in the Bellman equations by 
a suitably generalized universal Solomonoff distribution ^. The state space is the 
space of complete histories. AIXI is a universal theory without adjustable parame- 
ters, making no assumptions about the environment except that it is sampled from 
a computable distribution. From an algorithmic complexity perspective, the AIXI 
model generalizes optimal passive universal induction to the case of active agents. 
From a decision-theoretic perspective, AIXI is a suggestion of a new (implicit) "learn- 
ing" algorithm, which may overcome all (except computational) problems of previous 
reinforcement learning algorithms. 

There are strong arguments that AIXI is the most intelligent unbiased agent 
possible. We outline for a number of problem classes, including sequence prediction, 
strategic games, function minimization, reinforcement and supervised learning, how 
the AIXI model can formally solve them. The major drawback of the AIXI model 
is that it is incomputable. To overcome this problem, we construct a modified 
algorithm AlXltl that is still effectively more intelligent than any other time t and 
length / bounded agent. The computation time of AlXIt/ is of the order t-2K Other 
discussed topics are a formal definition of an intelligence order relation, the horizon 
problem and relations of the AIXI theory to other AI approaches. 

The article is meant to be a gentle introduction to and discussion of the AIXI 
model. For a mathematically rigorous treatment, many subtleties, and proofs see 
the references to the author's works in the annotated bibliography section at the 
end of this article, and in particular the book |Hut04j . This section also provides 
references to introductory textbooks and original publications on algorithmic infor- 
mation theory and sequential decision theory. 

Chapter presents the theory of sequential decisions in a very general form 
(called Al/i model) in which actions and perceptions may depend on arbitrary past 
events. We clarify the connection to the Bellman equations and discuss minor pa- 
rameters including (the size of) the I/O spaces and the lifetime of the agent and 
their universal choice which we have in mind. Optimality of Al/i is obvious by 
construction. 

Chapterl^- How and in which sense induction is possible at all has been subject 
to long philosophical controversies. Highlights are Epicurus' principle of multiple 
explanations, Occam's razor, and probability theory. Solomonoff elegantly unified 
all these aspects into one formal theory of inductive inference based on a univer- 
sal probability distribution ^, which is closely related to Kolmogorov complexity 
K{x), the length of the shortest program computing x. Rapid convergence of ^ to 
the unknown true environmental distribution /i and tight loss bounds for arbitrary 
bounded loss functions and finite alphabet can be shown. Pareto optimality of ^ 
in the sense that there is no other predictor that performs better or equal in all 
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environments and strictly better in at least one can also be shown. In view of these 
results it is fair to say that the problem of sequence prediction possesses a universally 
optimal solution. 

Chapter^ In the active case, reinforcement learning algorithms are usually used 
if /i is unknown. They can succeed if the state space is either small or has effec- 
tively been made small by generalization techniques. The algorithms work only in 
restricted (e.g. Markovian) domains, have problems with optimally trading off explo- 
ration versus exploitation, have nonoptimal learning rate, are prone to diverge, or are 
otherwise ad hoc. The formal solution proposed here is to generalize Solomonoff 's 
universal prior ^ to include action conditions and replace /i by ^ in the AI/x model, 
resulting in the AI^ =AIXI model, which we claim to be universally optimal. We 
investigate what we can expect from a universally optimal agent and clarify the 
meanings of universal, optimal, etc. Other discussed topics are formal definitions of 
an intelligence order relation, the horizon problem, and Pareto optimality of AIXI. 

Chapter\^ We show how a number of AI problem classes fit into the general AIXI 
model. They include sequence prediction, strategic games, function minimization, 
and supervised learning. We first formulate each problem class in its natural way 
(for known /i) and then construct a formulation within the AI/x model and show 
their equivalence. We then consider the consequences of replacing /x by ^. The main 
goal is to understand in which sense the problems are solved by AIXI. 

Chapter The major drawback of AIXI is that it is incomputable, or more 
precisely, only asymptotically computable, which makes an implementation impos- 
sible. To overcome this problem, we construct a modified model AlXIt/, which is 
still superior to any other time t and length I bounded algorithm. The computa- 
tion time of AIXK/ is of the order t-2K The solution requires an implementation of 
first-order logic, the definition of a universal Turing machine within it and a proof 
theory system. 

Chapter^ Finally we discuss and remark on some otherwise unmentioned top- 
ics of general interest. We remark on various topics, including concurrent actions 
and perceptions, the choice of the I/O spaces, treatment of encrypted informa- 
tion, and peculiarities of mortal embodies agents. We continue with an outlook 
on further research, including optimality, down-scaling, implementation, approxi- 
mation, elegance, extra knowledge, and training of/for AIXI()f:/). We also include 
some (personal) remarks on non- computable physics, the number of wisdom Vl, and 
consciousness. 

An annotated bibliography and other references conclude this work. 

2 Agents in Known Probabilistic Environments 

The general framework for AI might be viewed as the design and study of intelligent 
agents [RN03j . An agent is a cybernetic system with some internal state, which acts 
with output Uk on some environment in cycle fc, perceives some input Xk from the en- 
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vironment and updates its internal state. Then the next cycle follows. We split the 
input Xk into a regular part Ok and a reward , often called reinforcement feedback. 
From time to time the environment provides nonzero reward to the agent. The task 
of the agent is to maximize its utility, defined as the sum of future rewards. A proba- 
bilistic environment can be described by the conditional probability /i for the inputs 
xi...Xn to the agent under the condition that the agent outputs ?/i...y„. Most, if not 
all environments are of this type. We give formal expressions for the outputs of the 
agent, which maximize the total /i-expected reward sum, called value. This model 
is called the AI/x model. As every AI problem can be brought into this form, the 
problem of maximizing utility is hence being formally solved, if /i is known. Further- 
more, we study some special aspects of the AI/x model. We introduce factorizable 
probability distributions describing environments with independent episodes. They 
occur in several problem classes studied in Section [5] and are a special case of more 
general separable probability distributions defined in Section 14. 3[ We also clarify 
the connection to the Bellman equations of sequential decision theory and discuss 
similarities and differences. We discuss minor parameters of our model, including 
(the size of) the input and output spaces X and y and the lifetime of the agent, and 
their universal choice, which we have in mind. There is nothing remarkable in this 
section; it is the essence of sequential decision theory [NM441 IBelSTj IBTQGj ISB98j , 
presented in a new form. Notation and formulas needed in later sections are simply 
developed. There are two major remaining problems: the problem of the unknown 
true probability distribution /i, which is solved in Section HI and computational 
aspects, which are addressed in Section [61 

2.1 The Cybernetic Agent Model 

A good way to start thinking about intelligent systems is to consider more generally 
cybernetic systems, in AI usually called agents. This avoids having to struggle 
with the meaning of intelligence from the very beginning. A cybernetic system is a 
control circuit with input y and output x and an internal state. From an external 
input and the internal state the agent calculates deterministically or stochastically 
an output. This output (action) modifies the environment and leads to a new input 
(perception). This continues ad infinitum or for a finite number of cycles. 

Definition 1 (The Agent Model) An agent is a system that interacts with an 
environment in cycles k = 1,2,3,.... In cycle k the action (output) yk & y of the 
agent is determined by a policy p that depends on the I/O-history yiXi...yk-iXk-i- 
The environment reacts to this action and leads to a new perception (input) x^G A" 
determined by a deterministic function q or probability distribution fi, which depends 
on the history yiXi...yk~iXk-.iyk- Then the next cycle k + 1 starts. 

As explained in the last section, we need some reward assignment to the cybernetic 
system. The input x is divided into two parts, the standard input a and some reward 
input r. If input and output are represented by strings, a deterministic cybernetic 
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system can be modeled by a Turing machine p, where p is called the policy of the 
agent, which determines the (re)action to a perception. If the environment is also 
computable it might be modeled by a Turing machine q as well. The interaction of 
the agent with the environment can be illustrated as follows: 
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Both p as well as q have unidirectional input and output tapes and bidirectional 
work tapes. What entangles the agent with the environment is the fact that the 
upper tape serves as input tape for p, as well as output tape for g, and that the lower 
tape serves as output tape for p as well as input tape for q. Further, the reading 
head must always be left of the writing head, i.e. the symbols must first be written 
before they are read. Both p and q have their own mutually inaccessible work tapes 
containing their own "secrets". The heads move in the following way. In the k*^ 
cycle p writes q reads yk-, q writes Xk = rkOk, p reads Xk = rkOk, followed by the 
{k + iy^ cycle and so on. The whole process starts with the first cycle, all heads on 
tape start and work tapes being empty. We call Turing machines behaving in this 
way chronological Turing machines. Before continuing, some notations on strings 
are appropriate. 



2.2 Strings 

We denote strings over the alphabet X by s = XiX2...Xn, with Xk G X, where X is 
alternatively interpreted as a nonempty subset of IV or itself as a prefix-free set of 
binary strings. The length of s is £{s) =£{xi)-\-...+i{xn)- Analogous definitions hold 
for e y. We call Xk the k*^ input word and the k^^ output word (rather than 
letter). The string s—yiXi...ynXn represents the input/output in chronological order. 
Due to the prefix property of the Xk and yk, s can be uniquely separated into its 
words. The words appearing in strings are always in chronological order. We further 
introduce the following abbreviations: e is the empty string, Xn;rn'-=XnXn+i---Xm-iXm 
for n<m and e for n>m. x<„ :=Xi...x„_i. Analogously for y. Further, yjCn'^—yn^n, 

yXn:m-^ynXn--ymXm, and SO On. 
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2.3 AI Model for Known Deterministic Environment 

Let us define for tlie clironological Turing macfiine p a partial function also named 
p : X* ^ y* with yi;k = p{x^k), where yi:k is the output of Turing machine p on 
input x<fc in cycle k, i.e. where p has read up to Xk-i but no further|3 In an anal- 
ogous way, we define q:y*^X* with Xi,k = qiyi-.k) ■ Conversely, for every partial 
recursive chronological function we can define a corresponding chronological Tur- 
ing machine. Each (agent,environment) pair produces a unique I/O sequence 
ujPi\ = y^^x^y2^x^ .... When we look at the definitions of p and q we see a nice sym- 
metry between the cybernetic system and the environment. Until now, not much 
intelligence is in our agent. Now the credit assignment comes into the game and 
removes the symmetry somewhat. We split the input XfcGA': = 7^x(9 into a regular 
part OfcGC and a reward VkETlGlR. We define Xk = rkOk and rk=r{xk). The goal 
of the agent should be to maximize received rewards. This is called reinforcement 
learning. The reason for the asymmetry is that eventually we (humans) will be the 
environment with which the agent will communicate and we want to dictate what 
is good and what is wrong, not the other way round. This one-way learning, the 
agent learns from the environment, and not conversely, neither prevents the agent 
from becoming more intelligent than the environment, nor does it prevent the en- 
vironment learning from the agent because the environment can itself interpret the 
outputs yk as a regular and a reward part. The environment is just not forced to 
learn, whereas the agent is. In cases where we restrict the reward to two values 
rG7^ = -£?:={0,l}, r = l is interpreted as a positive feedback, called good or correct, 
and r = a negative feedback, called bad or error. Further, let us restrict for a while 
the lifetime (number of cycles) m of the agent to a large but finite value. Let 

m 

VkZ ■■= T^rixD 

i=k 

be the future total reward (called future utility), the agent p receives from the 
environment q in the cycles k to m. It is now natural to call the agent p* that 
maximizes Vim (called total utility), the best one|l 

p*:=argmaxn^^ ^ Vfj >VZ : = (1) 

For k = l the condition on p is nil. For A; > 1 it states that p shall be consistent 
with p* in the sense that they have the same history, li X,y and m are finite, the 
number of different behaviors of the agent, i.e. the search space is finite. Therefore, 
because we have assumed that q is known, p* can effectively be determined by pre- 
analyzing all behaviors. The main reason for restricting to finite m was not to 

^Note that a possible additional dependence of p on y<fc as mentioned in Definition [1] can be 
eliminated by recursive substitution; see below. Similarly for q. 

^argmaxpy(p) is the p that maximizes V{-). If there is more than one maximum we might 
choose the lexicographically smallest one for definiteness. 
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ensure comput ability of p* but that the hmit m — > oo might not exist. The ease 
with which we defined and computed the optimal policy p* is not remarkable. Just 
the (unrealistic) assumption of a completely known deterministic environment q has 
trivialized everything. 



2.4 AI Model for Known Prior Probability 

Let us now weaken our assumptions by replacing the deterministic environment q 
with a probability distribution over chronological functions. Here /i might be 
interpreted in two ways. Either the environment itself behaves stochastically defined 
by yU or the true environment is deterministic, but we only have subjective (proba- 
bilistic) information of which environment is the true environment. Combinations 
of both cases are also possible. We assume here that /z is known and describes 
the true stochastic behavior of the environment. The case of unknown /z with the 
agent having some beliefs about the environment lies at the heart of the AI^ model 
described in Section HI 

The best or most intelligent agent is now the one that maximizes the expected 
utility (called value function) VJ = V^i^:=X]g/^(Q')Wm- This defines the AI/i model. 

Definition 2 (The AI/i model) The Alfi model is the agent with policy p^ that 
maximizes the ^-expected total reward ri + ... + rm, i-c. p* = := argmaXpV^. Its 
value IS V*:=Vf. 

We need the concept of a value function in a slightly more general form. 

Definition 3 (The /i/true/generating value function) The agent's perception 
X consists of a regular observation o&O and a reward r & G M. In cycle k the 
value V^^{ipc^k) is defined as the ^-expectation of the future reward sum rfc + ...+rm, 
with actions generated by policy p, and fixed history yx<^k- We say that \4^(?/r<fc) is 
the (future) value of policy p in environment /x given history yx<^k, or shorter, the 
or true or generating value of p given ip^^k- y^'-=Vim ^s the (total) value of p. 

We now give a more formal definition for V^^. Let us assume we are in cycle k with 
history and ask for the 6es^ output i/^. Further, \&t Qk'- = {q'-q{jj<k)=i<k} 

be the set of all environments producing the above history. We say that q&Qk is 
consistent with history yx^k- The expected reward for the next m — k + 1 cycles 
(given the above history) is called the value of policy p and is given by a conditional 
probability: 



pq 
km 



Policy p and environment /i do not determine history ific^^k-, unlike the deterministic 
case, because the history is no longer deterministically determined by p and g, but 
depends on p and and on the outcome of a stochastic process. Every new cycle 
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adds new information (xj) to the agent. This is indicated by the dots over the 
symbols. In cycle k we have to maximize the expected future rewards, taking into 
account the information in the history ^<fe. This information is not already present 
in p and g//i at the agent's start, unlike in the deterministic case. 

Furthermore, we want to generalize the finite lifetime m to a dynamic (com- 
putable) farsightedness hk = 'mk — k + l>l, called horizon. For mk = m we have our 
original finite lifetime; for hk = h the agent maximizes in every cycle the next h 
expected rewards. A discussion of the choices for is delayed to Section 14.51 The 
next hk rewards are maximized by 

pI := ciTgmaxVZSyi^k), 
pdPk 

where Pk: = {p:^yk-p{i<k) = y<kyk} is the set of systems consistent with the current 
history. Note that pi depends on k and is used only in step k to determine i/k 
by Pk{i<k\y<k) = 'y<kyk- After writing i/k the environment replies with Xk with 
(conditional) probability iJi{Qk+i) I l^iQk)- This probabilistic outcome provides new 
information to the agent. The cycle k + 1 starts with determining ijk+i from p%j_.i 
(which can differ from p^ for dynamic m^) and so on. Note that p% implicitly also 
depends on ij^k because Pk and Qk do so. But recursively inserting p],_i and so on, 
we can define 

P*{x^k) ■■= P*k{x<k\p*k^i{x<k^i\...p*i)) (3) 

It is a chronological function and computable if X, y and ruk are finite and fi is 
computable. For constant m one can show that the policy coincides with the 
Al/i model (Definition [2]). This also proves 

Vkmilfi^<k) > V^^{w:<k) Vp consistent with yjc^k (4) 

similarly to ([1]). For k = l this is obvious. We also call ([3]) Al/i model. For de- 
terministicH n this model reduces to the deterministic case discussed in the last 
subsection. 

It is important to maximize the sum of future rewards and not, for instance, to be 
greedy and only maximize the next reward, as is done e.g. in sequence prediction. 
For example, let the environment be a sequence of chess games, and each cycle 
corresponds to one move. Only at the end of each game is a positive reward r = 1 
given to the agent if it won the game (and made no illegal move). For the agent, 
maximizing all future rewards means trying to win as many games in as short as 
possible time (and avoiding illegal moves). The same performance is reached if we 
choose hk much larger than the typical game lengths. Maximization of only the next 
reward would be a very bad chess playing agent. Even if we would make our reward 
r finer, e.g. by evaluating the number of chessmen, the agent would play very bad 
chess for hk = l, indeed. 



■^We call a probability distribution deterministic if it assumes values and 1 only. 
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The Al/i model still depends on /i and m^; is addressed in Section 1^31 To get 
our final universal AI model the idea is to replace /i by the universal probability ^, 
defined later. This is motivated by the fact that ^ converges to in a certain sense for 
any fi. With ^ instead of our model no longer depends on any parameters, so it is 
truly universal. It remains to show that it behaves intelligently. But let us continue 
step by step. In the following we develop an alternative but equivalent formulation 
of the Alfi model. Whereas the functional form presented above is more suitable for 
theoretical considerations, especially for the development of a time-bounded version 
in Section [HI the iterative and recursive formulation of the next subsections will be 
more appropriate for the explicit calculations in most of the other sections. 

2.5 Probability Distributions 

We use Greek letters for probability distributions, and underline their arguments 
to indicate that they are probability arguments. Let Pn{xi---Xn) be the probability 
that an (infinite) string starts with We drop the index on p if it is clear 

from its arguments: 

J2 Pfe:n) =J2pnUl:n) = Pn-l{x<n) = P(£<n), P(e) = Po(e) = 1- (5) 

We also need conditional probabilities derived from the chain rule. We prefer a nota- 
tion that preserves the chronological order of the words, in contrast to the standard 
notation p{-\-) that fiips it. We extend the definition of p to the conditional case with 
the following convention for its arguments: An underlined argument proba- 
bility variable, and other non-underlined arguments represent conditions. With 
this convention, the conditional probability has the form p(a;<„x„)=p(xi.„)/p(x<„). 
The equation states that the probability that a string xi...Xn-i is followed by Xn is 
equal to the probability of xi...Xn* divided by the probability of xi...Xn-i*- We use 
X* as an abbreviation for 'strings starting with x\ 

The introduced notation is also suitable for defining the conditional probability 
p{yiXi---ynXn) that the environment reacts with xi...Xn under the condition that the 
output of the agent is j/i. The environment is chronological, i.e. input Xj depends 
on yx^^iUi only. In the probabilistic case this means that p{iSL<ckyk) '■= Hx^PiW-i-.k) 
is independent of hence a tailing yk in the arguments of p can be dropped. 
Probability distributions with this property will be called chronological. The y are 
always conditions, i.e. are never underlined, whereas additional conditioning for the 
X can be obtained with the chain rule 

p{yC<nWn) = P{mi:n) / P{m<n) and (6) 

pimi-.n) = p{mi)-p{wcim2)- - -piW^KnW^n)- 
The second equation is the first equation applied n times. 
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2.6 Explicit Form of the Al/i Model 

Let us define the AI/i model p* in a different way: Let nitpc^kULk) be the true 
probabihty of input Xk in cycle fc, given the history zpc^kUk] l^iuLi-.k) is the true 
chronological prior probability that the environment reacts with xi-k if provided 
with actions yi-k from the agent. We assume the cybernetic model depicted on 
page [7] to be valid. Next we define the value V^^^ milP^i-k) to be the /i-expected 
reward sum rk+i + ...+rm in cycles k + 1 to m with outputs i/i generated by agent p* 
that maximizes the expected reward sum, and responses Xi from the environment, 
drawn according to ^. Adding r{xk)=Tk we get the reward including cycle k. The 
probability of x^, given yx^kUki is given by the conditional probabihty <kU!±k) ■ 
So the expected reward sum in cycles k to m given xfc^kHk is 

Vkm{wc<kyk) ■■= Y^i^i^k) +Vk^^,^{wci:k)]-fi{lfl:<kmk) (7) 

Xk 

Now we ask how p* chooses t/k- It should choose t/k as to maximize the future 
rewards. So the expected reward in cycles k to m given tfc^k and yk chosen by p* is 
VZiw^<k)- = ^aXy^V^^{yi;^kyk) (see Figure H. 
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Figure 4 (Expectimax Tree/ Algorithm for = y = IB) 



Together with the induction start 

ym+l,m{m:m) ■= (8) 

is completely defined. We might summarize one cycle into the formula 

Vkmiwc<k) = niax^[r(xfc) + Vfc*{'i„(?/ri;fc)]-/i(?/r<fcjfffc) (9) 

Xk 

We introduce a dynamic (computable) farsightedness hk = mk — k + l>l, called 
horizon. For rrik = m, where m is the lifetime of the agent, we achieve optimal 
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behavior, for limited farsightedness hk = h {m = mk = h+k — l), the agent maximizes 
in every cycle the next h expected rewards. A discussion of the choices for rrik is 
delayed to Section I4.5[ If ruk is our horizon function of p* and yi:<k is the actual 
history in cycle k, the output ijk of the agent is explicitly given by 

Vk = argmaxV^^ {ifl:<kyk) (10) 

Vk " 

which in turn defines the policy p*. Then the environment responds Xk with proba- 
bility ii{yx<kW!Lk)- Then cycle k+1 starts. We might unfold the recursion ([9]) further 
and give ijk nonrecursively as 

Vk = Vk ■= argmax^max^ ... max^ (r(a;fc)+ ••• +r(x™J)-/i(jfl;<fcj£fc ) (11) 

This has a direct interpretation: The probability of inputs Xk-.m^ in cycle k when the 
agent outputs Uk-.mk with actual history ific<k is ^{ip:<,kU!Lk:mk)- The future reward in 
this case is r(xfc) + ... + r(xm,fc)- The best expected reward is obtained by averaging 
over the Xj (I]^^.) and maximizing over the Ui. This has to be done in chronological 
order to correctly incorporate the dependencies of Xi and Ui on the history. This is 
essentially the expectimax algorithm/tree [MicGGjlRNOSj . The AI/i model is optimal 
in the sense that no other policy leads to higher expected reward. The value for a 
general policy p can be written in the form 

^fcm(Z/2^<fc) := Y.^'^k+ ■■■+r^)li{3fl:<kmk:m)\yv.ra=p{x<m) (12) 

3^1:m 

As is clear from their interpretations, the iterative environmental probability /i re- 
lates to the functional form in the following way: 

/^(?^l:fc) = M (13) 

q-qiyi:k)=xi:k 

With this identification one can show [HutOOl IHut04j the following: 

Theorem 5 (Equivalence of functional and explicit AI model) The actions 
of the functional AI model ^ coincide with the actions of the explicit (recur- 
sive/iterative) AI model with environments identified by l[T^) . 



2.7 Factorizable Environments 

Up to now we have made no restrictions on the form of the prior probability /i apart 
from being a chronological probability distribution. On the other hand, we will see 
that, in order to prove rigorous reward bounds, the prior probability must satisfy 
some separability condition to be defined later. Here we introduce a very strong 
form of separability, when factorizes into products. 
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Assume that the cycles are grouped into independent episodes r — 1,2,3,..., where 
each episode r consists of the cycles k—rir+l,.. .,nr+i for some 0—no<ni<...<ns—n: 

/^imi-.n) = II ^^rimur+l-.nr+i) (14) 
r=0 

(In the simplest case, when all episodes have the same length / then nr = r-l). Then 
ijk depends on fir and x and y of episode r only, with r such that rir <k<nr+i. One 
can show that 

i/k ^ argniaxVfc^^(?)i:<fc|/fc) = argniax Vfc7(|)i;<fc|/fc) (15) 

with t := min{mfc,n,._|_i}. The different episodes are completely independent in the 
sense that the inputs Xk of different episodes are statistically independent and depend 
only on the outputs of the same episode. The outputs yk depend on the x and 
y of the corresponding episode r only, and are independent of the actual I/O of the 
other episodes. 

Note that yk is also independent of the choice of nik, as long as is sufficiently 
large. If all episodes have a length of at most /, i.e. n^+i — <l and if we choose 
the horizon hk to be at least then ■mk>k+l — l>nr + l>nr+i and hence t = nr+i 
independent of m^. This means that for factorizable n there is no problem in taking 
the limit rrifc— >cxo. Maybe this limit can also be performed in the more general case 
of a sufficiently separable /i. The (problem of the) choice of ruk will be discussed in 
more detail later. 

Although factorizable /x are too restrictive to cover all AI problems, they often 
occur in practice in the form of repeated problem solving, and hence, are worthy 
of study. For example, if the agent has to play games like chess repeatedly, or has 
to minimize different functions, the different games /functions might be completely 
independent, i.e. the environmental probability factorizes, where each factor corre- 
sponds to a game/function minimization. For details, see the appropriate sections 
on strategic games and function minimization. 

Further, for factorizable fx it is probably easier to derive suitable reward bounds 
for the universal AI^ model defined in the next section, than for the separable cases 
that will be introduced later. This could be a first step toward a definition and 
proof for the general case of separable problems. One goal of this paragraph was to 
show that the notion of a factorizable fj, could be the first step toward a definition 
and analysis of the general case of separable fj,. 

2.8 Constants and Limits 

We have in mind a universal agent with complex interactions that is at least as 
intelligent and complex as a human being. One might think of an agent whose 
input yk comes from a digital video camera, and the output Xk is some image to a 
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monitorp only for the rewards we might restrict to the most primitive binary ones, 
i.e. TfcGiB. So we think of the following constant sizes: 

1 < {i{ykXk)) < k < m <^ \y X X\ 
1 < 2^^ < 2^'' < 2^2 < 2^^^^^ 

The first two limits say that the actual number k of inputs/outputs should be 
reasonably large compared to the typical length (£) of the input/output words, 
which itself should be rather sizeable. The last limit expresses the fact that the 
total lifetime m (number of I/O cycles) of the agent is far too small to allow every 
possible input to occur, or to try every possible output, or to make use of identically 
repeated inputs or outputs. We do not expect any useful outputs for k<{i). More 
interesting than the lengths of the inputs is the complexity K{xi...Xk) of all inputs 
until now, to be defined later. The environment is usually not "perfect". The agent 
could either interact with an imperfect human or tackle a nondeterministic world 
(due to quantum mechanics or chaos)E| In either case, the sequence contains some 
noise, leading to K{xi...Xk)o:{i)-k. The complexity of the probability distribution of 
the input sequence is something different. We assume that this noisy world operates 
according to some simple computable rules. K{^k) <^ (^)-fc, i.e. the rules of the 
world can be highly compressed. We may allow environments in which new aspects 
appear for k^oo, causing a non-bounded K{fik). 

In the following we never use these limits, except when explicitly stated. In some 
simpler models and examples the size of the constants will even violate these limits 
(e.g. i{xk) =^{yk) = 1), but it is the limits above that the reader should bear in mind. 
We are only interested in theorems that do not degenerate under the above limits. 
In order to avoid cumbersome convergence and existence considerations we make 
the following assumptions throughout this work: 

Assumption 6 (Finiteness) We assume that 

• the input/perception space X is finite, 

• the output/ action space y is finite, 

• the rewards are nonnegative and bounded, i.e. gT^C [0,rmaa;], 

• the horizon m is finite. 

Finite X and bounded 71 (each separately) ensure existence of /^-expectations but 
are sometimes needed together. Finite 3^ ensures that argmaxy^.^^ [...] exists, i.e. 
that maxima are attained, while finite m avoids various technical and philosophical 
problems (Section H75l) . and positive rewards are needed for the time-bounded AlXltl 
model (Section [6]). Many theorems can be generalized by relaxing some or all of the 
above finiteness assumptions. 

^Humans can only simulate a screen as output device by drawing pictures. 
^Whether there exist truly stochastic processes at all is a difficult question. At least the quantum 
indeterminacy comes very close to it. 
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2.9 Sequential Decision Theory 

One can relate to the Bellman equations |Bel57j of sequential decision theory by 
identifying complete histories ips^k with states, iJ.{ift ^kWLk) with the state transition 
matrix, V* with the value function, and yu with the action in cycle k |BT96t IRN03] . 
Due to the use of complete histories as state space, the AI/i model neither assumes 
stationarity, nor the Markov property, nor complete accessibility of the environment. 
Every state occurs at most once in the lifetime of the system. For this and other 
reasons the explicit formulation (fTTl) is more natural and useful here than to enforce 
a pseudo-recursive Bellman equation form. 

As we have in mind a universal system with complex interactions, the action 
and perception spaces y and X are huge (e.g. video images), and every action or 
perception itself occurs usually only once in the lifespan m of the agent. As there is 
no (obvious) universal similarity relation on the state space, an effective reduction 
of its size is impossible, but there is no principle problem in determining from 
(11 II) as long as is known and computable, and X, y and m are finite. 

Things drastically change if is unknown. Reinforcement learning algorithms 
|KLM96( ISB981 IBT96j are commonly used in this case to learn the unknown ^ or 
directly its value. They succeed if the state space is either small or has effectively 
been made small by generalization or function approximation techniques. In any 
case, the solutions are either ad hoc, work in restricted domains only, have serious 
problems with state space exploration versus exploitation, or are prone to diverge, 
or have nonoptimal learning rates. There is no universal and optimal solution to 
this problem so far. The central theme of this article is to present a new model 
and argue that it formally solves all these problems in an optimal way. The true 
probability distribution /i will not be learned directly, but will be replaced by some 
generalized universal prior S,-, which converges to /i. 

3 Universal Sequence Prediction 

This section deals with the question of how to make predictions in unknown environ- 
ments. Following a brief description of important philosophical attitudes regarding 
inductive reasoning and inference, we describe more accurately what we mean by 
induction, and motivate why we can focus on sequence prediction tasks. The most 
important concept is Occam's razor (simplicity) principle. Indeed, one can show 
that the best way to make predictions is based on the shortest (= simplest) descrip- 
tion of the data sequence seen so far. The most general effective descriptions can be 
obtained with the help of general recursive functions, or equivalently by using pro- 
grams on Turing machines, especially on the universal Turing machine. The length 
of the shortest program describing the data is called the Kolmogorov complexity 
of the data. Probability theory is needed to deal with uncertainty. The environ- 
ment may be a stochastic process (e.g. gambling houses or quantum physics) that 
can be described by "objective" probabilities. But also uncertain knowledge about 
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the environment, which leads to behefs about it, can be modeled by "subjective" 
probabilities. The old question left open by subjectivists of how to choose the a 
priori probabilities is solved by Solomonoff 's universal prior, which is closely related 
to Kolmogorov complexity. SolomonofF's major result is that the universal (subjec- 
tive) posterior converges to the true (objective) environment (al probability) /i. The 
only assumption on fi is that fi (which needs not be known!) is computable. The 
problem of the unknown environment /j, is hence solved for all problems of inductive 
type, like sequence prediction and classification. 

3.1 Introduction 

An important and highly nontrivial aspect of intelligence is inductive inference. 
Simply speaking, induction is the process of predicting the future from the past, or 
more precisely, it is the process of finding rules in (past) data and using these rules to 
guess future data. Weather or stock-market forecasting, or continuing number series 
in an IQ test are nontrivial examples. Making good predictions plays a central role in 
natural and artificial intelligence in general, and in machine learning in particular. 
All induction problems can be phrased as sequence prediction tasks. This is, for 
instance, obvious for time-series prediction, but also includes classification tasks. 
Having observed data xt at times t<n, the task is to predict the n*'' symbol x„ from 
sequence xi...Xn-i- This prequential approach |Daw84j skips over the intermediate 
step of learning a model based on observed data and then using this model 

to predict x„. The prequential approach avoids problems of model consistency, 
how to separate noise from useful data, and many other issues. The goal is to 
make "good" predictions, where the prediction quality is usually measured by a 
loss function, which shall be minimized. The key concept to well-define and solve 
induction problems is Occam's ra2;or (simplicity) principle, which says that ^^Entities 
should not he multiplied beyond necessity" which may be interpreted as to keep the 
simplest theory consistent with the observations Xi...Xn-i and to use this theory to 
predict Xn- Before we can present Solomonoff 's formal solution, we have to quantify 
Occam's razor in terms of Kolmogorov complexity, and introduce the notion of 
subjective /objective probabilities. 

3.2 Algorithmic Information Theory 

Intuitively, a string is simple if it can be described in a few words, like "the string 
of one million ones", and is complex if there is no such short description, like for 
a random string whose shortest description is specifying it bit by bit. We can 
restrict the discussion to binary strings, since for other (non-stringy mathematical) 
objects we may assume some default coding as binary strings. Furthermore, we are 
only interested in effective descriptions, and hence restrict decoders to be Turing 
machines. Let us choose some universal (so-called prefix) Turing machine U with 
unidirectional binary input and output tapes and a bidirectional work tape |LV97l 
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IHut04] . We can then define the (conditional) prefix Kolmogorov complexity |Cha75l 
IGac74l IKol65t ILev74j of a binary string x as the length / of the shortest program p, 
for which U outputs the binary string x (given y) 

Definition 7 (Kolmogorov complexity) Let U be a universal prefix Turing ma- 
chine U . The (conditional) prefix Kolmogorov complexity is defined as the shortest 
program p, for which U outputs x (given y): 

K{x) := mm{i{p) : U (p) = x} , K{x\y) := mm{i{p) : U{y,p) = x} 

Simple strings like 000. ..0 can be generated by short programs, and hence have low 
Kolmogorov complexity, but irregular (e.g. random) strings are their own shortest 
description, and hence have high Kolmogorov complexity. An important property 
of K is that it is nearly independent of the choice of U. Furthermore, it shares many 
properties with Shannon's entropy (information measure) S, but K is superior to 
S in many respects. To be brief, K is an excellent universal complexity measure, 
suitable for quantifying Occam's razor. There is (only) one severe disadvantage: K 
is not finitely computable. The major algorithmic property of K is that it is (only) 
co-enumerable, i.e. it is approximable from above. 

For general (non-string) objects one can specify some default coding (■) and 
define K {object) ■. = K {{object)), especially for numbers and pairs, e.g. we abbreviate 
K{x,y) :=K{{x,y)). The most important information-theoretic properties of K are 

listed below, where we abbreviate f{x) <g{x)+0{l) by f{x) <g{x). We also later 

X 

abbreviate f{x)=0{g{x)) by f{x)<g{x). 

Theorem 8 (Information properties of Kolmogorov complexity) 

i) K{x) < i{x)+2\ogi{x), K{n) < logn+21oglogn 
ii) J2x'^~^^^'' ^ 1? K{x) > l{x) for 'most' X, K {n) oo for n oo . 

Hi) K{x\y) < K{x) < K{x,y) 

iv) K{x,y) < K{x) + K{y), K{xy) < K{x) + K{y) 
v) K{x\y,K{y))+K{y) ^ K{x,y) ^ K{y,x) ^ K {y\x ,K {x)) + K {x) 

vi) K{f{x)) < K{x) + K{f) if f:lB*^lB* is recursive/computable 

viz) K{x) < —\og2P{x) + K{P) if P -.IB* is recursive and J2xP{x) 

All (in)equalities remain valid if K is (further) conditioned under some z, i.e. 
K{...) ^ K{...\z) and K{...\y) ^ K{...\y,z). Those stated are all valid within an 
additive constant of size 0(1), but there are others which are only valid to logarith- 
mic accuracy. K has many properties in common with Shannon entropy as it should 
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be, since both measure the information content of a string. Property (i) gives an up- 
per bound on K, and property (ii) is Kraft's inequahty which imphes a lower bound 
on K vahd for 'most' n, where 'most' means that there are only o{N) exceptions for 
n^{l,...,N}. Providing side information y can never increase code length, requiring 
extra information y can never decrease code length {Hi). Coding x and y separately 
never helps {iv), and transforming x does not increase its information content {vi). 
Property (vi) also shows that if x codes some object o, switching from one coding 
scheme to another by means of a recursive bijection leaves K unchanged within 
additive 0(1) terms. The first nontrivial result is the symmetry of information (f), 
which is the analogue of the multiplication/chain rule for conditional probabilities. 
Property (vii) is at the heart of the MDL principle |Ris89j . which approximates 
K{x) by -\og2Pix) + K{P). See p797\ for proofs. 

3.3 Uncertainty &; Probabilities 

For the objectivist probabilities are real aspects of the world. The outcome of an 
observation or an experiment is not deterministic, but involves physical random 
processes. Kolmogorov's axioms of probability theory formalize the properties that 
probabilities should have. In the case of i.i.d. experiments the probabilities assigned 
to events can be interpreted as limiting frequencies {frequentist view), but appli- 
cations are not limited to this case. Conditionalizing probabilities and Bayes' rule 
are the major tools in computing posterior probabilities from prior ones. For in- 
stance, given the initial binary sequence what is the probability of the 
next bit being 1? The probability of observing Xn at time n, given past observations 
xi...Xn-i can be computed with the multiplication or chain ru 1^ if the true gener- 
ating distribution of the sequences xiX2Xz... is known: yu(a;<nX„)=/i(a;^.„)/yu(a;<„) 
(see Sections 12.21 and 12.51) . The problem, however, is that one often does not know 
the true distribution (e.g. in the cases of weather and stock-market forecasting). 

The sub jectivist uses probabilities to characterize an agent's degree of belief in (or 
plausibility of) something, rather than to characterize physical random processes. 
This is the most relevant interpretation of probabilities in AI. It is somewhat sur- 
prising that plausibilities can be shown to also respect Kolmogorov's axioms of 
probability and the chain rule for conditional probabilities by assuming only a few 
plausible qualitative rules they should follow |Cox46j . Hence, if the plausibility of 
Xi:n is ^(aii:„), the degree of belief in Xn given x<„ is, again, given by the conditional 
probability: ^(x<„x„) =^{x^,„)/^{x^n)- 

The the chain rule allows determining posterior probabilities/plausibilities from 
prior ones, but leaves open the question of how to determine the priors themselves. 
In statistical physics, the principle of indifference (symmetry principle) and the max- 
imum entropy principle can often be exploited to determine prior probabilities, but 
only Occam's razor is general enough to assign prior probabilities in every situation, 
especially to cope with complex domains typical for AI. 



^Strictly speaking it is just the definition of conditional probabilities. 
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3.4 Algorithmic Probability & Universal Induction 

Occam's razor (appropriately interpreted and in compromise with Epicurus' princi- 
ple of indifference) tells us to assign high/low a priori plausibility to simple/complex 
strings x. Using K as the complexity measure, any monotone decreasing function 
of K, e.g. ^(x) = 2"^^"^') would satisfy this criterion. But ^ also has to satisfy the 
probability axioms, so we have to be a bit more careful. Solomonoff |Sol64^ ISol78j 
defined the universal prior ^(x) as the probability that the output of a universal 
Turing machine U starts with x when provided with fair coin flips on the input 
tape. Formally, ^ can be defined as 

e(x) := '^''^"^ > 2-^(") (16) 

p : U(p)=x* 

where the sum is over all (so-called minimal) programs p for which U outputs a 
string starting with x. The inequality follows by dropping all terms in Yl,p except 
for the shortest p computing x. Strictly speaking ^ is only a semimeasure since it 
is not normalized to 1, but this is acceptable/correctable. We derive the following 
bound: 

oo oo 

Y.{l-^{x^tXt)f < -i5]lne(x<tx,) = -ilne(xi,oo) < lln2- K{x^.,^) 
t=i t=i 

In the first inequality we have used (1 — a)^<— |lna for 0<a<l. In the equality we 
exchanged the sum with the logarithm and eliminated the resulting product by the 
chain rule IQ. In the last inequality we used ( fT6l) . If a;i;oo is a computable sequence, 
then K{xi;oo) is finite, which implies ^{x^tiLt) ~^ 1 {J2tti{^ ~ (^t)"^ < cx) ^ — >• 1). 
This means, that if the environment is a computable sequence (whichsoever, e.g. 
the digits of tt or e in binary representation), after having seen the first few digits, ^ 
correctly predicts the next digit with high probability, i.e. it recognizes the structure 
of the sequence. 

Assume now that the true sequence is drawn from the distribution fi, i.e. the true 
(objective) probability of xi:„ is /i(xi.„), but fi is unknown. How is the posterior 
(subjective) belief ^(a;<„x„) =^(a;„)/^(a;<„) related to the true (objective) posterior 
probability /i(x<„x„)? Solomonoff's |Sol78] crucial result is that the posterior (sub- 
jective) beliefs converge to the true (objective) posterior probabilities, if the latter 
are computable. More precisely, he showed that 

oo 2 + 

EEMai<*)(^(a:<tQ)-/i(a:<tQ)) < |ln2-ir(/i), (17) 

t=l X^t 



K{fi) is finite if /i is computable, but the infinite sum on the l.h.s. can only be finite 
if the difference ^(a;<jO)— /i(a;<tO) tends to zero for t-^oo with /^-probability 1. This 
shows that using as an estimate for /i may be a reasonable thing to do. 
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3.5 Loss Bounds &; Pareto Optimality 

Most predictions are eventually used as a basis for some decision or action, which 
itself leads to some reward or loss. Let Ixtyt ^ [OA] C iR be the received loss when 
performing prediction/decision/action yt^y and Xt&X is the t^^ symbol of the 
sequence. Let y^Ey be the prediction of a (causal) prediction scheme A. The true 
probability of the next symbol being Xt, given a;<t, is fi{x^t2lt)- The expected loss 
when predicting yt is E[£2,^yJ. The total /x-expected loss suffered by the A scheme 
in the first n predictions is 

n n 

:= EE[4.,a] = E E -"(2IiJ4.,a (18) 

i=l t=lxi.,t£Xt 

For instance, for the error-loss ixy = i if x = y and else, is the expected number 
of prediction errors, which we denote by E^. The goal is to minimize the expected 
loss. More generally, we define the Ap sequence prediction scheme (later also called 
SPp) y^'' : = a.Ygmmy^^yJ2xtPi^<t^S-t)^xtyt which minimizes the p-expected loss. If p is 
known, A^ is obviously the best prediction scheme in the sense of achieving minimal 
expected loss {L^'" for any A). One can prove the following loss bound for the 
universal A^ predictor [HutOlbl IHutOlaj IHut03aj 



< L^-L^ < 21n2-K(/i) + 2VLA.ln2-i^(/i) (19) 

Together with L„ < n this shows that — -L^'' = 0(?2^^/^), i.e. asymptotically 
Ag achieves the optimal average loss of A^ with rapid convergence. Moreover is 
finite if is finite and L^?/L^^^l if is not finite. Bound f|T9|) also implies 



> — 2y L^eln2-i^r(/i), which shows that no (causal) predictor A whatsoever 
achieves significantly less (expected) loss than A^. In view of these results it is fair 
to say that, ignoring computational issues, the problem of sequence prediction has 
been solved in a universal way. 

A different kind of optimality is Pareto optimality. The universal prior is Pareto 
optimal in the sense that there is no other predictor that leads to equal or smaller 
loss in all environments. Any improvement achieved by some predictor A over Ag in 
some environments is balanced by a deterioration in other environments |Hut03cj . 



4 The Universal Algorithmic Agent AIXI 

Active systems, like game playing (SG) and optimization (FM), cannot be reduced 
to induction systems. The main idea of this work is to generalize universal induction 
to the general agent model described in Section [2l For this, we generalize ^ to include 
actions as conditions and replace p by ^ in the rational agent model, resulting in 
the AI^(=AIXI) model. In this way the problem that the true prior probability p is 
usually unknown is solved. Convergence of ^— >p can be shown, indicating that the 
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AI^ model could behave optimally in any computable but unknown environment 
with reinforcement feedback. 

The main focus of this section is to investigate what we can expect from a 
universally optimal agent and to clarify the meanings of universal, optimal, etc. 
Unfortunately bounds similar to the loss bound (JT9i) in the SP case can hold for 
no active agent. This forces us to lower our expectation about universally optimal 
agents and to introduce other (weaker) performance measures. Finally, we show 
that AI^ is Pareto optimal in the sense that there is no other policy yielding higher 
or equal value in all environments and a strictly higher value in at least one. 



4.1 The Universal AI^ Model 

Definition of the AI^ model. We have developed enough formalism to suggest 
our universal AI,^ model. All we have to do is to suitably generalize the universal 
semimeasure ^ from the last section and replace the true but unknown prior prob- 
ability fi^^ in the AI/x model by this generalized In what sense this AI^ model 
is universal will be discussed subsequently. 

In the functional formulation we define the universal probability C^^^ of an envi- 
ronment q just as 2"^'^'?^ 

:= 2-^('') 

The definition could not be easieiQjfl Collecting the formulas of Section 12.41 and 
replacing by ^{q) we get the definition of the AI^ agent in functional form. 
Given the history yr<fc the policy of the functional AI^ agent is given by 

:= argmax max '^''^'^ ' ^kZ, (20) 

in cycle k, where V^^ is the total reward of cycles k to nik when agent p interacts 
with environment q. We have dropped the denominator Y^qli{q) from ([2]) as it is 
independent of the p & Pk and a constant multiplicative factor does not change 
argmaxy^. 

For the iterative formulation, the universal probability ^ can be obtained by 
inserting the functional ^{q) into f|T3l) 

ami:k) = E 2-^(^) (21) 

g-g{yi:k)=xi:k 

Replacing /i by ^ in flTT]) the iterative AI^ agent outputs 
Vk = yi ■■= argniax^maxE ••• ■■■+r{x^^))-^{ifc^k1&k:mt,) (22) 



^It is not necessary to use 2 -^^'^ or something similar as some readers may expect at this point, 
because for every program q there exists a functionally equivalent program q with K{q)=£{q). 

®Here and later we identify objects with their coding relative to some fixed Turing machine 
U. For example, if g is a function K{q) ■.= K{{q)) with (q) being a binary coding of q such that 
U{{q),y)—q{y). Reversely, if q already is a binary string we define q{y) :—U{q,y). 



Universal Algorithmic Intelligence 



23 



in cycle k given the history yb<^k- 

The equivalence of the functional and iterative AI model (Theorem [5]) is true for 
every chronological semimeasure p, especially for ^, hence we can talk about the AI^ 
model in this respect. It (slightly) depends on the choice of the universal Turing 
machine. i{{q)) is defined only up to an additive constant. The AI,^ model also 
depends on the choice of X = TZxO and 3^, but we do not expect any bias when 
the spaces are chosen sufficiently simple, e.g. all strings of length 2^^. Choosing IN 
as the word space would be ideal, but whether the maxima (suprema) exist in this 
case, has to be shown beforehand. The only nontrivial dependence is on the horizon 
function which will be discussed later. So apart from and unimportant 
details the AI^ agent is uniquely defined by (!20l) or (!22|) . It does not depend on any 
assumption about the environment apart from being generated by some computable 
(but unknown!) probability distribution. 

Convergence of ^ to /i. Similarly to (fT7j) one can show that the /i-expected 
squared difference of p and ^ is finite for computable p. This, in turn, shows that 
^{yx<kW.k) converges rapidly to p{ift^kl[Lk) for k ^ oo with /i-probability 1. The 
line of reasoning is the same; the y are pure spectators. This will change when 
we analyze loss/reward bounds analogous to f|T9l) . More generally, one can show 
|Hut04j that@ 

^{w<kWLk.,m^)^-^ H{w:<k1&k:mk) (23) 

This gives hope that the outputs ijk of the AI^ model fl22|) could converge to the 
outputs yk from the AI/x model ffTTl) . 

We want to call an AI model universal, if it is /i-independent (unbiased, model- 
free) and is able to solve any solvable problem and learn any learnable task. Further, 
we call a universal model, universally optimal, if there is no program, which can solve 
or learn significantly faster (in terms of interaction cycles). Indeed, the AI^ model 
is parameter free, ^ converges to /i the AI/z model is itself optimal, and we 
expect no other model to converge faster to Al/i by analogy to SP (IT^ : 

Claim 9 (We expect AIXI to be universally optimal) 

This is our main claim. In a sense, the intention of the remaining sections is to 
define this statement more rigorously and to give further support. 

Intelligence order relation. We define the ^-expected reward in cycles to m of 
a policy p similar to ([2]) and fl20l) . We extend the definition to programs p^Pk that 
are not consistent with the current history. 

Vktiw^<k) := ^ E 2-^^^) ■ Vt (24) 

^Here, and everywhere else, with £,k^fJ-k we mean ^fc — /ife^O, and not that fik (and £^k) itself 
converge to a limiting value. 



24 



Marcus Hutter, Technical Report, IDSIA-01-03 



The normalization M is again only necessary for interpreting Vkm as the expected 
reward but is otherwise unneeded. For consistent policies p&Pk we define p:=p. For 
p^Pk, p is a modification of p in such a way that its outputs are consistent with the 
current history |)r<fc, hence pEPk, but unaltered for the current and future cycles 
>k. Using this definition of Vkm we could take the maximium over all policies p in 
fl20l) . rather than only the consistent ones. 

Definition 10 (Intelligence order relation) We call a policy p more or equally 
intelligent than p' and write 

php' :^ VA;V2)i;<fc:V;l^(p;<,)>Vi«^(p;<fe). 

i.e. if p yields in any circumstance higher C,- expected reward than p' . 

As the algorithm p* behind the AI^ agent maximizes V^^^ we have p^yp for all p. 
The AI^ model is hence the most intelligent agent w.r.t. ^. Relation ^ is a universal 
order relation in the sense that it is free of any parameters (except m^) or specific 
assumptions about the environment. A proof, that ^ is a reliable intelligence order 
(which we believe to be true), would prove that AI^ is universally optimal. We 
could further ask: How useful is >z for ordering policies of practical interest with 
intermediate intelligence, or how can >z help to guide toward constructing more 
intelligent systems with reasonable computation time? An effective intelligence order 
relation will be defined in Section [6], which is more useful from a practical point 
of view. 

4.2 On the Optimality of AIXI 

In this section we outline ways toward an optimality proof of AIXI. Sources of 
inspiration are the SP loss bounds proven in Section [3] and optimality criteria from 
the adaptive control literature (mainly) for linear systems |KV86j . The value bounds 
for AIXI are expected to be, in a sense, weaker than the SP loss bounds because the 
problem class covered by AIXI is much larger than the class of induction problems. 
Convergence of ^ to has already been proven, but is not sufficient to establish 
convergence of the behavior of the AIXI model to the behavior of the AI/x model. 
We will focus on three approaches toward a general optimality proof: 

What is meant by universal optimality. The first step is to investigate what we 
can expect from AIXI, i.e. what is meant by universal optimality. A "learner" (like 
AIXI) may converge to the optimal informed decision-maker (like AI/x) in several 
senses. Possibly relevant concepts from statistics are, consistency, self-tunability, 
self-optimization, efficiency, unbiasedness, asymptotic or finite convergence |KV86] . 
Pareto optimality, and some more defined in Section 14^31 Some concepts are stronger 
than necessary, others are weaker than desirable but suitable to start with. Self- 
optimization is defined as the asymptotic convergence of the average true value 
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■^Vi^'^ of AI,^ to the optimal value ^V\*^. Apart from convergence speed, self- 
optimization of AIXI would most closely correspond to the loss bounds proven for 
SP. We investigate which properties are desirable and under which circumstances 
the AIXI model satisfies these properties. We will show that no universal model, 
including AIXI, can in general be self-optimizing. On the other hand, we show that 
AIXI is Pareto optimal in the sense that there is no other policy that performs better 
or equal in all environments, and strictly better in at least one. 

Limited environmental classes. The problem of defining and proving general 
value bounds becomes more feasible by considering, in a first step, restricted con- 
cept classes. We analyze AIXI for known classes (like Markovian or factorizable 
environments) and especially for the new classes (forgetful, relevant, asymptotically 
learnable, farsighted, uniform, pseudo-passive, and passive) defined later in Sec- 
tion 14.31 In Section [5] we study the behavior of AIXI in various standard problem 
classes, including sequence prediction, strategic games, function minimization, and 
supervised learning. 

Generalization of AIXI to general Bayes mixtures. The other approach is 
to generalize AIXI to AI^, where ({) = J2u€M'^i'^i.) is a general Bayes mixture of 
distributions u in some class Ai. If is the multi-set of enumerable semimea- 
sures enumerated by a Turing machine, then Al( coincides with AIXI. If A4 is the 
(multi)set of passive effective environments, then AIXI reduces to the A^ predictor 
that has been shown to perform well. One can show that these loss/value bounds 
generalize to wider classes, at least asymptotically |Hut02b] . Promising classes are, 
again, the ones described in Section 14. 3[ In particular, for ergodic mdps we showed 
that Al( is self-optimizing. Obviously, the least we must demand from Ai to have 
a chance of finding a self-optimizing policy is that there exists some self-optimizing 
policy at all. The key result in |Hut02b] is that this necessary condition is also suf- 
ficient. More generally, the key is not to prove absolute results for specific problem 
classes, but to prove relative results of the form "if there exists a policy with certain 
desirable properties, then AI^ also possesses these desirable properties". If there 
are tasks that cannot be solved by any policy, Al( cannot be blamed for failing. 
Environmental classes that allow for self-optimizing policies include bandits, i.i.d. 
processes, classification tasks, certain classes of pomdps, fc^'^-order ergodic mdps, 
factorizable environments, repeated games, and prediction problems. Note that in 
this approach we have for each environmental class a corresponding model AI^, 
whereas in the approach pursued in this article the same universal AIXI model is 
analyzed for all environmental classes. 

Optimality by construction. A possible further approach toward an optimality 
"proof" is to regard AIXI as optimal by construction. This perspective is common 
in various (simpler) settings. For instance, in bandit problems, where pulling arm i 
leads to reward 1 (0) with unknown probability pi (l—pi), the traditional Bayesian 
solution to the uncertainty about Pi is to assume a uniform (or Beta) prior over 
Pi and to maximize the (subjectively) expected reward sum over multiple trials. 
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The exact solution (in terms of Gittins indices) is widely regarded as "optimal", 
although justified alternative approaches exist. Similarly, but simpler, assuming 
a uniform subjective prior over the Bernoulli parameter G [0,1], one arrives at 
the reasonable, but more controversial, Laplace rule for predicting i.i.d. sequences. 
AIXI is similar in the sense that the unknown e is the analogue of the unknown 
pG [0,1], and the prior beliefs Wu = 2~^^'^^ justified by Occam's razor are the analogue 
of a uniform distribution over [0,1]. In the same sense as Gittins' solution to the 
bandit problem and Laplace' rule for Bernoulli sequences, AIXI may also be regarded 
as optimal by construction. Theorems relating AIXI to Al/i would not be regarded 
as optimality proofs of AIXI, but just as how much harder it becomes to operate 
when /i is unknown, i.e. the achievements of the first three approaches are simply 
reinterpreted. 

4.3 Value Bounds and Separability Concepts 

Introduction. The values Vkm associated with the AI systems correspond roughly 
to the negative loss —L^ of the SP systems. In SP, we were interested in small 
bounds for the loss excess L^^—L^. Unfortunately, simple value bounds for AI^ in 
terms of Vkm analogous to the loss bound (HM do not hold. We even have difficulties 
in specifying what we can expect to hold for AI,^ or any AI system that claims to 
be universally optimal. Consequently, we cannot have a proof if we don't know 
what to prove. In SP, the only important property of n for proving loss bounds 
was its complexity K{fi). We will see that in the AI case, there are no useful 
bounds in terms of K{fi) only. We either have to study restricted problem classes or 
consider bounds depending on other properties of /i, rather than on its complexity 
only. In the following, we will exhibit the difficulties by two examples and introduce 
concepts that may be useful for proving value bounds. Despite the difficulties in even 
claiming useful value bounds, we nevertheless, firmly believe that the order relation 
(Definition [TU]) correctly formalizes the intuitive meaning of intelligence and, hence, 
that the AI^ agent is universally optimal. 

(Pseudo) Passive fi and the HeavenHell example. In the following we choose 
mk = m. We want to compare the true, i.e. /x-expected value V^/^ of a /i-independent 
universal policy p^'^^^ with any other policy p. Naively, we might expect the existence 
of a policy p'"^''* that maximizes Vi^i apart from additive corrections of lower order 
for m — s> oo 

VC > VZ-o{...) V/i,p (25) 

Such policies are sometimes called self-optimizing [K V86j . Note that Vi^^>Vi^\/p, 
but p^ is not a candidate for (a universal) p^^^* as it depends on /i. On the other hand, 
the policy of the AI^ agent maximizes by definition (p^yp). As is thought 
to be a guess of Vij^, we might expect p*^^*=p5 to approximately maximize V^/^, i.e. 
( I25l) to hold. Let us consider the problem class (set of environments) A^ = {/io,/ii} 
with 3^ = 7^ = {0,1} and rk = 5iy^ in environment /Zj, where the Kronecker symbol 5xy 
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is defined as 1 for x = y and otlierwise. Tlie first action yi decides wliether you go 
to lieaven witli all future rewards rk being 1 (good) or to hell with all future rewards 
being (bad). Note that /ij are (deterministic, non-ergodic) mdps: 



= ( Hell I V^O (Start 1 ^^-^ '(Heaven 



It is clear that if /ij, i.e. i is known, the optimal policy p^* is to output yi = i in the 
first cycle with Vi^^ = m. On the other hand, any unbiased policy p'''^'^* independent 
of the actual /i either outputs yi = l or ?/i = 0. Independent of the actual choice 
2/1, there is always an environment (/i = yUi_yJ for which this choice is catastrophic 
(1/j^ ^^ = 0). No single agent can perform well in both environments /io and /ii. The 
r.h.s. of fl2^ equals m — 0(171) for p=p^. For all p''^** there is a /i for which the l.h.s. 
is zero. We have shown that no p'"^'^* can satisfy (!25l) for all /i and p, so we cannot 
expect p^ to do so. Nevertheless, there are problem classes for which (125|) holds, for 
instance SP. For SP, (!25|) is just a reformulation of (flQll with an appropriate choice 
for p'"^^^^ namely (which differs from p^, see next section). We expect f l25p to 
hold for all inductive problems in which the environment is not infiuencec0 by the 
output of the agent. We want to call these fi, passive or inductive environments. 
Further, we want to call A4 and fi&Ai satisfying fl25|) with p''e«*=p^ pseudo-passive. 
So we expect inductive fi to be pseudo-passive. 

The OnlyOne example. Let us give a further example to demonstrate the dif- 
ficulties in estabhshing value bounds. Let A" = 7^ ={0,1} and |3^| be large. We 
consider all (deterministic) environments in which a single complex output y* is 
correct (r = l) and all others are wrong (r = 0). The problem class Ai is defined by 

M := {fiy* ■.y*ey, K{y*) =Llog|3^|j}, where IJ,y*{wc<kykX) ■= 5?/^?/* V/c. 

There are A = 13^1 such y*. The only way a /i-independent policy p can find the 
correct y* , is by trying one y after the other in a certain order. In the first A — 1 
cycles, at most A— 1 different y are tested. As there are A different possible y*, 
there is always a /iG for which p gives erroneous outputs in the first A— 1 cycles. 
The number of errors is > A — 1 = \y \ = 2^^y"> = 2^^^^^ for this /i. As this is true 

for any p, it is also true for the AI^ model, hence < 2^^^^ is the best possible 
error bound we can expect that depends on K{fi) only. Actually, we will derive such 
a bound in Section [3?T] for inductive environments. Unfortunately, as we are mainly 
interested in the cycle region A;<|3;|^2^(^) (see SectionEl]) this bound IS vacuous. 
There are no interesting bounds for deterministic /x depending on K{fi) only, unlike 
the SP case. Bounds must either depend on additional properties of /i or we have to 



^"^Of course, the reward feedback rk depends on the agent's output. What we have in mind is, 
like in sequence prediction, that the true sequence is not influenced by the agent. 
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consider specialized bounds for restricted problem classes. The case of probabilistic 
/i is similar. Whereas for SP there are useful bounds in terms of L^^ and K{fi), 
there are no such bounds for AI^. Again, this is not a drawback of AI^ since for no 
unbiased AI system could the errors/rewards be bound in terms of K{fi) and the 
errors/rewards of Al/i only. 

There is a way to make use of gross (e.g. 2^^^^) bounds. Assume that after 
a reasonable number of cycles k, the information x^k perceived by the AI^ agent 
contains a lot of information about the true environment /i. The information in 
x^k might be coded in any form. Let us assume that the complexity K{^\x^k) of /i 
under the condition that x^k is known, is of order 1. Consider a theorem, bounding 
the sum of rewards or of other quantities over cycles I...00 in terms of f{K{fi)) for a 
function / with /(0(1)) = 0(1), like /(n) =2". Then, there will be a bound for cycles 
k...oo in terms of ~/(-ft'(/i|x<fc)) = 0(1). Hence, a bound like 2^^^^ can be replaced 
by small bound ^2'^*^^l^<'=^ = 0(1) after k cycles. All one has to show/ensure/assume 
is that enough information about n is presented (in any form) in the first k cycles. 
In this way, even a gross bound could become useful. In Section we use a similar 
argument to prove that AI^ is able to learn supervised. 

Asymptotic learnability. In the following, we weaken fl25l) in the hope of getting 
a bound applicable to wider problem classes than the passive one. Consider the 
I/O sequence 2/1X1... caused by AI^. On history yr<A:, AI^ will output yk = yi 
in cycle k. Let us compare this to y'^ what Al/i would output, still on the same 
history yi:<:k produced by AI^. As Alfj, maximizes the /i-expected value, AI^ causes 
lower (or at best equal) V^mf. if Vk differs from y^. Let Dnfj,^ ■.= 'E[J2k=i^ — ^yt^ y^] 
be the /x-expected number of suboptimal choices of AI^, i.e. outputs different from 
Al/j, in the first n cycles. One might weigh the deviating cases by their severity. 
In particular, when the //-expected rewards \4^^ for yl and are equal or close 
to each other, this should be taken into account in a definition of Dnfj,^, e.g. by a 

weight factor [Vk^{yi:<:k) — Vkm{w^<k)]- These details do not matter in the following 
qualitative discussion. The important difference to fl25|) is that here we stick to the 
history produced by AI^ and count a wrong decision as, at most, one error. The 
wrong decision in the HeavenHell example in the first cycle no longer counts as 
losing m rewards, but counts as one wrong decision. In a sense, this is fairer. One 
shouldn't blame somebody too much who makes a single wrong decision for which 
he just has too little information available, in order to make a correct decision. The 
AI^ model would deserve to be called asymptotically optimal if the probability of 
making a wrong decision tends to zero, i.e. if 

Dnf,^/n^O for n ^ 00, i.e. Dn^g = o{n). (26) 

We say that /i can be asymptotically learned (by AI^) if (!26|) is satisfied. We claim 
that AI^ (for ruk 00) can asymptotically learn every problem n of relevance, i.e. 
AI^ is asymptotically optimal. We included the qualifier of relevance, as we are 
not sure whether there could be strange /i spoiling (12^ but we expect those /i to 
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be irrelevant from the perspective of AI. In the field of Learning, there are many 
asymptotic learnability theorems, often not too difficult to prove. So a proof of (!26|) 
might also be feasible. Unfortunately, asymptotic learnability theorems are often 
too weak to be useful from a practical point of view. Nevertheless, they point in the 
right direction. 

Uniform ^. From the convergence (12^ of ^ — >• /i we might expect V^^^ ^kmk 
all p, and hence we might also expect yl defined in ( !22|) to converge to yj^ defined 
in (fTTj) for /c— >oo. The first problem is that if the Vkmk fo^' the different choices of 
y^ are nearly equal, then even if V^^^ ~ ^kruk ' Vki^Vk possible due to the non- 
continuity of argmaxy^. This can be cured by a weighted -D„^g as described above. 
More serious is the second problem we explain for hk = l and X = TZ= {0,1}. For 
yl = axgmaxy^^{yr<kykl) to converge to |/^ = argmaxj^^/i(jr<fci/fcl), it is not sufficient 
to know that ^{if<kiLk) ~^ f^{if<kUlk) proven in fl23|) . We need convergence not 
only for the true output yk, but also for alternative outputs yk- yl converges to y^ 
if ^ converges uniformly to /i, i.e. if in addition to fl2^ 

\Kwc<ky'kx!k) - ^{wc<ky'kx!k)\ < c-\fi{y):<k]&k) - ^{wc<kmk)\ "^yWk (27) 

holds for some constant c (at least in a /i-expected sense). We call fj, satisfying (!27|) 
uniform. For uniform n one can show ( !26ll with appropriately weighted Dnn^ and 
bounded horizon hk<hmax- Unfortunately there are relevant fi that are not uniform. 

Other concepts. In the following, we briefiy mention some further concepts. 
A Markovian /i is defined as depending only on the last cycle, i.e. ii{ip:<kW!Lk) = 
Hkixk-iKLk)- We say /i is generalized (l^^-order) Markovian, if [^iifc ^kWLk) = 
l^'ki.Xk-iWk-i+v.k-iWHk) for fixed /. This property has some similarities to factor- 
izable /i defined in f[T^ . If further /ifc = /iiV/c, is called stationary. Further, we call 
fi (C,) forgetful if fJ^iyic <:kyLk) {.0.^ <kWik)) become(s) independent of ift^i for fixed / 
and k-^oo with /i-probability 1. Further, we say is farsighted if W.m.m^^ooilk^^'' 
exists. More details will be given in Section 14.51 where we also give an example of 
a farsighted for which nevertheless the limit mk—>- oo makes no sense. 

Summary. We have introduced several concepts that might be useful for proving 
value bounds, including forgetful, relevant, asymptotically learnable, farsighted, uni- 
form, (generalized) Markovian, factorizable and (pseudo) passive /i. We have sorted 
them here, approximately in the order of decreasing generality. We will call them 
separability concepts. The more general (like relevant, asymptotically learnable and 
farsighted) /i will be called weakly separable, the more restrictive (like (pseudo) 
passive and factorizable) /i will be called strongly separable, but we will use these 
qualifiers in a more qualitative, rather than rigid sense. Other (non-separability) 
concepts are deterministic ^ and, of course, the class of all chronological ^. 
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4.4 Pareto Optimality of AI^ 

This subsection shows Pareto-opimtahty of AI^ analogous to SP. The total /i- 
expected reward V^^ of policy of the AI^ model is of central interest in judging the 
performance of Al^. We know that there are policies (e.g. p'^ of AI/x) with higher 
/i-value (V^ > V^^). In general, every policy based on an estimate p oi jj, that is 
closer to /i than ^ is, outperforms in environment /i, simply because it is more 
tailored toward /i. On the other hand, such a system probably performs worse than 
p^ in other environments. Since we do not know p in advance we may ask whether 
there exists a policy p with better or equal performance than p^ in all environments 
v&M. and a strictly better performance for one i^G A4. This would clearly render 
p^ suboptimal. One can show that there is no such p |Hut02b] 

Definition 11 (Pareto Optimality) A policy p is called Pareto optimal if there 
is no other policy p with V^>V^ for all veM. and strict inequality for at least one 

V. 

Theorem 12 (Pareto Optimality) Alt, alias p^ is Pareto optimal. 

Pareto optimality should be regarded as a necessary condition for an agent aiming 
to be optimal. From a practical point of view, a significant increase of V for many 
environments v may be desirable, even if this causes a small decrease of V for a few 
other V. The impossibility of such a "balanced" improvement is a more demanding 
condition on p^ than pure Pareto optimality. In |Hut02bj it has been shown that 
AI^ is also balanced Pareto optimal. 

4.5 The Choice of the Horizon 

The only significant arbitrariness in the AI^ model lies in the choice of the horizon 
function hk=mk — k + l. We discuss some choices that seem to be natural and give 
preliminary conclusions at the end. We will not discuss ad hoc choices of for 
specific problems (like the discussion in Section [5?^ in the context of finite strategic 
games). We are interested in universal choices of m^. 

Fixed horizon. If the lifetime of the agent is known to be m, which is in practice 
always large but finite, then the choice = m maximizes correctly the expected 
future reward. Lifetime m is usually not known in advance, as in many cases the 
time we are willing to run an agent depends on the quality of its outputs. For this 
reason, it is often desirable that good outputs are not delayed too much, if this 
results in a marginal reward increase only. This can be incorporated by damping 
the future rewards. If, for instance, the probability of survival in a cycle is 7 < 1, 
an exponential damping (geometric discount) rk := r'j^-'j'' is appropriate, where ri 
are bounded, e.g. r^, G [0,1]. Expression fl22|) converges for 00 in this casel.^^1 

^ ^ More precisely, yfc = argmax lim 14*1 (7/i<fe?/fe) exists. 
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But this does not solve the problem, as we introduced a new arbitrary time scale 
(1—7)^^. Every damping introduces a time scale. Taking 7— i>l is prone to the same 
problems as ruk ^ cxo in the undiscounted case discussed below. 

Dynamic horizon (universal & harmonic discounting). The largest horizon 
with guaranteed finite and enumerable reward sum can be obtained by the universal 
discount rk^rk-2~^'^^\ This discount results in truly farsighted agent with effective 
horizon that grows faster than any computable function. It is similar to a near- 
harmonic discount ^ r^ - k~^^^^\ since 2~^^''^ <^/k for most k and 2"^^'^^ > 
c/ {klo^k). More generally, the time-scale invariant damping factor = r'^^-k'"" 
introduces a dynamic time scale. In cycle k the contribution of cycle 2^/" ■ k is 
damped by a factor |. The effective horizon in this case is The choice hk=(3-k 
with (3 ~ 2^/" qualitatively models the same behavior. We have not introduced an 
arbitrary time scale m, but limited the farsightedness to some multiple (or fraction) 
of the length of the current history. This avoids the preselection of a global time 
scale m or This choice has some appeal, as it seems that humans of age 

k years usually do not plan their lives for more than, perhaps, the next k years 
(/3/iuman ~ !)• From a practical point of view this model might serve all needs, but 
from a theoretical point we feel uncomfortable with such a limitation in the horizon 
from the very beginning. Note that we have to choose (3 = 0{1) because otherwise we 
would again introduce a number /5, which has to be justified. We favor the universal 
discount 7^ = 2"-^'^^^ since it allows us, if desired, to "mimic" all other more greedy 
behaviors based on other discounts 7^ by choosing r^G [0,c-7fc] C [0,2"-'^'^'^^]. 

Infinite horizon. The naive limit nik — 00 in (!22|) may turn out to be well de- 
fined and the previous discussion superfiuous. In the following, we suggest a limit 
that is always well defined (for finite 3^). Let 2/^™'°'' be defined as in (12^ with de- 
pendence on mk made explicit. Further, let := { ij^^''^ : > m} be the set of 
outputs in cycle k for the choices mfc = m,m+l,m+2,.... Because y^'' l^y^^^^ {} ■, 
we have 3^1°°'' := nm=fc^fc'^^ 7^ {}• We define the = 00 model to output any 
2/^°°'' G 3^^°°"*. This is the best output consistent with some arbitrary large choice 
of nik- Choosing the lexicographically smallest would correspond to the 

lower limit Yim „^r^.ii\^\ which always exists (for finite 3^). Generally ij^^^ £ 3^1°°'' 
is unique, i.e. |3^fc°°''| = 1 iff the naive limit lim^n^ooilk^^ exists. Note that the limit 
limm_^ooVfc*^(yE<fe) need not exist for this construction. 

Average reward and differential gain. Taking the raw average reward (rfc + ...+ 
'^m) / {f^ — k + 1) and m^oo also does not help: consider an arbitrary policy for the 
first k cycles and the/an optimal policy for the remaining cycles fc + l...oo. In e.g. 
i.i.d. environments the limit exists, but all these policies give the same average value, 
since changing a finite number of terms does not affect an infinite average. In mdp 
environments with a single recurrent class one can define the relative or differential 
gain jBT96j . In more general environments (we are interested in) the differential 
gain can be infinite, which is acceptable, since differential gains can still be totally 
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ordered. The major problem is the existence of the differential gain, i.e. whether it 
converges for m— >^oo in iRU{oo} at all (and does not oscillate). This is just the old 
convergence problem in slightly different form. 

Immortal agents are lazy. The construction in the next to previous paragraph 
leads to a mathematically elegant, no-parameter Al,^ model. Unfortunately this is 
not the end of the story. The limit nik — > oo can cause undesirable results in the 
Al/i model for special fi, which might also happen in the AI^ model whatever we 
define mk^ oo. Consider an agent who for every y/l consecutive days of work, can 
thereafter take / days of holiday. Formally, consider y = X =TZ = {0 ,1} . Output yk = 
shall give reward = and output yk = l shall give = 1 iff yf^_i_^i---yk-i = 0...0 
for some I, i.e. the agent can achieve / consecutive positive rewards if there was a 
preceding sequence of length at least Vl with yj^ — rk — 0. If the lifetime of the Al/j, 
agent is m, it outputs yk = in the first s cycles and then yk = ^ for the remaining 
cycles with s such that s + s^ = m. This will lead to the highest possible total 
reward Vim = s'^ = m+^ — ym+Yi. Any fragmentation of the and 1 sequences would 
reduce Vim, e.g. alternatingly working for 2 days and taking 4 days off would give 
^im = 1^- For m — > oo the Al/i agent can and will delay the point s of switching 
to iik — i indefinitely and always output leading to total reward 0, obviously the 
worst possible behavior. The AI^ agent will explore the above rule after a while 
of trying 7/^ = 0/1 and then applies the same behavior as the Al/i agent, since the 
simplest rules covering past data dominate ^. For finite m this is exactly what we 
want, but for infinite m the AI^ model (probably) fails, just as the AI/x model does. 
The good point is that this is not a weakness of the AI^ model in particular, as Al/j, 
fails too. The bad point is that m^— >oo has far-reaching consequences, even when 
starting from an already very large mk = m. This is because the ^ of this example is 
highly nonlocal in time, i.e. it may violate one of our weak separability conditions. 

Conclusions. We are not sure whether the choice of m^. is of marginal importance, 
as long as nik is chosen sufficiently large and of low complexity, 771^ = 2^^'* for instance, 
or whether the choice of ruk will turn out to be a central topic for the AI^ model 
or for the planning aspect of any AI system in general. We suppose that the limit 
ruk — > oo for the AI^ model results in correct behavior for weakly separable /x. A 
proof of this conjecture, if true, would probably give interesting insights. 

4.6 Outlook 

Expert advice approach. We considered expected performance bounds for pre- 
dictions based on Solomonoff's prior. The other, dual, currently very popular ap- 
proach, is "prediction with expert advice" (PEA) invented by Littlestone and War- 
muth (1989), and Vovk (1992). Whereas PEA performs well in any environment, 
but only relative to a given set of experts , our A^ predictor competes with any 
other predictor, but only in expectation for environments with computable distribu- 
tion. It seems philosophically less compromising to make assumptions on prediction 
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strategies than on the environment, however weak. One could investigate whether 
PEA can be generahzed to the case of active agents, which would result in a model 
dual to AIXI. We believe the answer to be negative, which on the positive side would 
show the necessity of Occam's razor assumption, and the distinguishedness of AIXI. 

Actions as random variables. The uniqueness for the choice of the generalized 
^ (fT6|) in the AIXI model could be explored. From the originally many alternatives, 
which could all be ruled out, there is one alternative which still seems possible. 
Instead of defining ^ as in one could treat the agent's actions y also as universally 
distributed random variables and then conditionalize ^ on ?/ by the chain rule. 

Structure of AIXI. The algebraic properties and the structure of AIXI could be 
investigated in more depth. This would extract the essentials from AIXI which 
finally could lead to an axiomatic characterization of AIXI. The benefit is as in 
any axiomatic approach. It would clearly exhibit the assumptions, separate the 
essentials from technicalities, simplify understanding and, most important, guide in 
finding proofs. 

Restricted policy classes. The development in this section could be scaled down 
to restricted classes of policies V. One may define = argmaxpgp\^^. For instance, 
consider a finite class of quickly computable policies. For mdps, is quickly com- 
putable and can be (efficiently) computed by Monte Carlo sampling. Maximizing 
over the finitely many policies p e P selects the asymptotically best policy from 
V for all (ergodic) mdps |Hut02bj . 

4.7 Conclusions 

All tasks that require intelligence to be solved can naturally be formulated as a 
maximization of some expected utility in the framework of agents. We gave an 
explicit expression flTTl) of such a decision-theoretic agent. The main remaining 
problem is the unknown prior probability distribution of the environment (s). 
Conventional learning algorithms are unsuitable, because they can neither handle 
large (unstructured) state spaces nor do they converge in the theoretically minimal 
number of cycles nor can they handle non-stationary environments appropriately. 
On the other hand, the universal semimeasure ^ (ITB]) . based on ideas from algo- 
rithmic information theory, solves the problem of the unknown prior distribution 
for induction problems. No explicit learning procedure is necessary, as ^ automat- 
ically converges to /x. We unified the theory of universal sequence prediction with 
the decision-theoretic agent by replacing the unknown true prior fi^^ by an appro- 
priately generalized universal semimeasure We gave strong arguments that 
the resulting AI^ model is universally optimal. Furthermore, possible solutions to 
the horizon problem were discussed. In Section [5] we present a number of problem 
classes, and outline how the AI^ model can solve them. They include sequence pre- 
diction, strategic games, function minimization and, especially, how AI^ learns to 
learn supervised. In Section we develop a modified time-bounded (computable) 
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AlXIt/ version. 

5 Important Problem Classes 

In order to give further support for the universahty and optimahty of the AI^ theory, 
we apply AI^ in this section to a number of problem classes. They include sequence 
prediction, strategic games, function minimization and, especially, how AI^ learns 
to learn supervised. For some classes we give concrete examples to illuminate the 
scope of the problem class. We first formulate each problem class in its natural way 
(when ^p''°''i<=™ is known) and then construct a formulation within the Al/i model and 
prove its equivalence. We then consider the consequences of replacing fi by ^. The 
main goal is to understand why and how the problems are solved by AI^. We only 
highlight special aspects of each problem class. Sections l5.1H5.5l together should give 
a better picture of the AI^ model. We do not study every aspect for every problem 
class. The subsections may be read selectively, and are not essential to understand 
the remainder. 

5.1 Sequence Prediction (SP) 

We introduced the AI^ model as a unification of ideas of sequential decision theory 
and universal probability distribution. We might expect AI^ to behave identically 
to SP^, when faced with a sequence prediction problem, but things are not that 
simple, as we will see. 

Using the AI/i model for sequence prediction. We saw in Section [3] how to 
predict sequences for known and unknown prior distribution fi^^ . Here we consider 
binary sequence^ ZiZ2Z3...E]B^ with known prior probability fjp^ {Z1Z2Z3...). 

We want to show how the Alfi model can be used for sequence prediction. We 
will see that it makes the same prediction as the SP/i agent. For simplicity we only 
discuss the special error loss ixy = ^ — Sxy, where 6 is the Kronecker symbol, defined 
as Sab = 1 for a = 6 and otherwise. First, we have to specify how the Al/i model 
should be used for sequence prediction. The following choice is natural: 

The system's output yk is interpreted as a prediction for the fc*'^ bit Zk of the 
string under consideration. This means that i/k is binary {ykElB=:y). As a reaction 
of the environment, the agent receives reward = 1 if the prediction was correct 
{yk = Zk), or rfc = if the prediction was erroneous {yky^Zk). The question is what 
the observation Ok in the next cycle should be. One choice would be to inform the 
agent about the correct k^'^ bit of the string and set Ok = Zk. But as from the reward 
Tfc in conjunction with the prediction y^, the true bit Zk = 5y^rk can be inferred, this 
information is redundant. There is no need for this additional feedback. So we set 
Ofc = eG(9 = {e}, thus having Xfc=rfcG7^ = A' = {0,l}. The agent's performance does 



"'^^We use Zk to avoid notational conflicts with the agent's inputs Xk- 
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not change when we include this redundant information; it merely complicates the 
notation. The prior probability /i^^ of the Al/i model is 

fJ'^\yiXi...ykXk) = IJ-^\yiri...ykrk) = iJ,^^{ 5y^ri--5y^rk ) = /j,^^ { zi...Zk ) (28) 

In the following, we will drop the superscripts of n because they are clear from the 
arguments of n and the n equal in any case. It is intuitively clear and can formally 
be shown |HutOOl IHut04j that maximizing the future reward V/^ is identical to 
greedily maximizing the immediate expected reward V^j^. There is no exploration- 
exploitation tradeoff in the prediction case. Hence, Al/i acts with 

ijk = aigraaxV^k iw:<kyk) = aigmax^r k-n^\yr<kyrk) = argmax/i^^(ii...4_iZfc) 

Vk Vk 2fc 

'k 

(29) 

The first equation is the definition of the agent's action ( ITOl) with rrik replaced by 
k. In the second equation we used the definition ([9]) of Vkm- In the last equation we 
used (l28l) and rk = Sy^Zk- 

So, the Al/i model predicts that Zk that has maximal //-probability, given 
Zi...Zk-i. This prediction is independent of the choice of m^. It is exactly the 
prediction scheme of the sequence predictor SP/i with known prior described in Sec- 
tion [23] (with special error loss). As this model was optimal, Al/i is optimal too, i.e. 
has minimal number of expected errors (maximal /i-expected reward) as compared 
to any other sequence prediction scheme. From this, it is clear that the value V^^ 
must be closely related to the expected error 

dlHD- Indeed one can show that 
ViJ^ = m — E^'^, and similarly for general loss functions. 

Using the AI^ model for sequence prediction. Now we want to use the uni- 
versal AI,^ model instead of AlyU for sequence prediction and try to derive error/loss 
bounds analogous to (HM . Like in the Alfj, case, the agent's output i/k in cycle k 
is interpreted as a prediction for the k^^ bit Zk of the string under consideration. 
The reward is = Sy^.^,, and there are no other inputs Ok = e. What makes the 
analysis more difficult is that ^ is not symmetric in y^rj ^ (1 — yj)(l — r,) and (!28|) 
does not hold for ^. On the other hand, converges to in the limit (1231) . and 
( |28l) should hold asymptotically for ^ in some sense. So we expect that everything 
proven for Alfi holds approximately for AI^. The Al$, model should behave similarly 
to Solomonoff prediction SP^. In particular, we expect error bounds similar to f|T9l) . 
Making this rigorous seems difficult. Some general remarks have been made in the 
last section. Note that bounds like fl25|) cannot hold in general, but could be valid 
for Al^ in (pseudo) passive environments. 

Here we concentrate on the special case of a deterministic computable environ- 
ment, i.e. the environment is a sequence z = ziZ2--- with K{zi-oo)<oo. Furthermore, 
we only consider the simplest horizon model ruk = k, i.e. greedily maximize only 
the next reward. This is sufficient for sequence prediction, as the reward of cycle 
k only depends on output yk and not on earlier decisions. This choice is in no way 
sufficient and satisfactory for the full AI^ model, as one single choice of rrik should 
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serve for all AI problem classes. So AI^ should allow good sequence prediction for 
some universal choice of rrik and not only for mk = k, which definitely does not suffice 
for more complicated AI problems. The analysis of this general case is a challenge 
for the future. For mk = k the AI^ model (l22l) with Oj = e and rfcG{0,l} reduces to 

yk = argmax^rfc-^(jr<fel£fc) = argmax^(jr<fc?/fcl) (30) 

'k 

The environmental response is given by it is 1 for a correct prediction 

iilk = Zk) and otherwise. One can show [HutOOl IHut04j that the number of wrong 
predictions of the AI(^ model fl5P]) in these environments is bounded by 

^Ai^ ^ 2^(^i^°°) < oo (31) 

for a computable deterministic environment string ziZ2---- The intuitive interpre- 
tation is that each wrong prediction eliminates at least one program p of size 

^{p) <K{z). The size is smaller than K{z), as larger policies could not mislead 
the agent to a wrong prediction, since there is a program of size K{z) making a 
correct prediction. There are at most 2^'^^^^^'^^^ such policies, which bounds the 
total number of errors. 

We have derived a finite bound for E^^, but unfortunately, a rather weak one 
as compared to (fT9|) . The reason for the strong bound in the SP case was that every 
error eliminates half of the programs. 

The AI^ model would not be sufficient for realistic applications if the bound (!3T|) 
were sharp, but we have the strong feeling (but only weak arguments) that better 
bounds proportional to K{z) analogous to (IT^ exist. The current proof technique is 
not strong enough for achieving this. One argument for a better bound is the formal 
similarity between argmax2j,,^(i<fc^^) and fl30l) . the other is that we were unable to 
construct an example sequence for which AI,^ makes more than 0{K{z)) errors. 

5.2 Strategic Games (SG) 

Introduction. A very important class of problems are strategic games (SG). Game 
theory considers simple games of chance like roulette, combined with strategy like 
backgammon, up to purely strategic games like chess or checkers or go. In fact, 
what is subsumed under game theory is so general that it includes not only a huge 
variety of game types, but can also describe political and economic competitions 
and coalitions, Darwinism and many more topics. It seems that nearly every AI 
problem could be brought into the form of a game. Nevertheless, the intention of a 
game is that several players perform actions with (partial) observable consequences. 
The goal of each player is to maximize some utility function (e.g. to win the game). 
The players are assumed to be rational, taking into account all information they 
posses. The different goals of the players are usually in conflict. For an introduction 
into game theory, see [FTOTl l()R94[ IRNn3[ [NM44] . 



Universal Algorithmic Intelligence 



37 



If we interpret the AI system as one player and the environment models the other 
rational player and the environment provides the reinforcement feedback , we see 
that the agent-environment configuration satisfies all criteria of a game. On the 
other hand, the AI models can handle more general situations, since they interact 
optimally with an environment, even if the environment is not a rational player with 
conflicting goals. 

Strictly competitive strategic games. In the following, we restrict ourselves to 
deterministic, strictly competitive strategic!^ games with alternating moves. Player 
1 makes move yk in round k, followed by the move Ok of player 20 So a game with n 
rounds consists of a sequence of alternating moves yiOiy202---ynOn- At the end of the 
game in cycle n the game or final board situation is evaluated with V{yiOi...ynOn)- 
Player 1 tries to maximize V, whereas player 2 tries to minimize V. In the simplest 
case, y is 1 if player 1 won the game, V = —l if player 2 won and V = for a draw. 
We assume a fixed game length n independent of the actual move sequence. For 
games with variable length but maximal possible number of moves n, we could add 
dummy moves and pad the length to n. The optimal strategy (Nash equilibrium) 
of both players is a minimax strategy 



But note that the minimax strategy is only optimal if both players behave rationally. 
If, for instance, player 2 has limited capabilites or makes errors and player 1 is able to 
discover these (through past moves) , he could exploit these weaknesses and improve 
his performance by deviating from the minimax strategy. At least the classical game 
theory of Nash equilibria does not take into account limited rationality, whereas the 
AI^ agent should. 

Using the AlyU model for game playing. In the following, we demonstrate the 
applicability of the AI model to games. The AI/x model takes the position of player 
1. The environment provides the evaluation V. For a symmetric situation we could 
take a second Alfi model as player 2, but for simplicity we take the environment 
as the second player and assume that this environmental player behaves according 
to the minimax strategy (1521) . The environment serves as a perfect player and as a 
teacher, albeit a very crude one, as it tells the agent at the end of the game only 
whether it won or lost. 

The minimax behavior of player 2 can be expressed by a (deterministic) proba- 



^•^In game theory, games like chess are often cahed 'extensive', whereas 'strategic' is reserved for 
a different kind of game. 

"'^"'We anticipate notationally the later identification of the moves of player 1/2 with the ac- 
tions/observations in the AI models. 




(32) 
(33) 



yk = argmaxmin...maxminU(?/i6i...?/fc_i6fc_i?/fcOfc...?/„o„). 



Vk Ofc Vn On 
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bility distribution /x^'-^ as the following: 

if Ok = aTgmm...maxmmV{yiOi...yko',,...yy^) y k 

/i^°(l/lOi...2/nOn) := { o', y'„ o'„ 

otherwise 

(34) 

The probability that player 2 makes move Ok is ^^^{yidi...ykQk), which is 1 for Ok = dk 
as defined in (!32|) and otherwise. 

Clearly, the AI/i system receives no feedback, i.e. ri = ... = r„_i = 0, until the 
end of the game, where it should receive positive/negative/neutral feedback on a 
win/loss/draw, i.e. r„ = V^(...). The environmental prior probability is therefore 

„Ai/ , ^ ^, ^ _ f fJ'^^{yiQi...ynQn) if ri...r„_i = and r„ = 1/(2/101. ..?/„o„) 
/i {yix,...ynxj - j Q otherwise 

(35) 

where Xi = riOi. If the environment is a minimax player (!32|) plus a crude teacher 
1/, i.e. if /i^^ is the true prior probability, the question now is, what is the behavior 
y^^ of the Al/i agent. It turns out that if we set = n the Alfj, agent is also a 
minimax player fl33|) and hence optimal (?/^^ = ?/f'^, see |HutOOi [Hut04] for a formal 



proof). Playing a sequence of games is a special case of a factorizable described in 
Section [2n with identical factors fir for all r and equal episode lengths nr+i—nr = n. 
Hence, in a minimax environment Al/i behaves itself as a minimax strategy, 

= argmaxmin... max min V{'iprn+i:k-i---Wk:ir+i)n) (36) 

i/k Ok y(r + l)n 0(r + l)n 

with r such that rn<k< (r + l)n and for any choice of rrik as long as the horizon 
hk>n. 

Using the AI^ Model for Game Playing. When going from the specific AlyU 
model, where the rules of the game are explicitly modeled into the prior probability 
to the universal model AI^, we have to ask whether these rules can be learned 
from the assigned rewards r^. Here, the main reason for studying the case of repeated 
games rather than just one game arises. For a single game there is only one cycle 
of nontrivial feedback, namely the end of the game, which is too late to be useful 
except when further games follow. 

We expect that no other learning scheme (with no extra information) can learn 
the game more quickly than AI^, since /x^^ factorizes in the case of games of fixed 
length, i.e. fi^^ satisfies a strong separability condition. In the case of variable 
game length the entanglement is also low. fi^^ should still be sufficiently separable, 
allowing us to formulate and prove good reward bounds for AI^. A qualitative 
argument goes as follows: 

Since initially, AI^ loses all games, it tries to draw out a loss as long as possible, 
without having ever experienced or even knowing what it means to win. Initially, 
AI^ will make a lot of illegal moves. If illegal moves abort the game resulting in 
(non-delayed) negative reward (loss), AI^ can quickly learn the typically simple 
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rules concerning legal moves, which usually constitute most of the rules; just the 
goal rule is missing. After having learned the move-rules, AI^ learns the (negatively 
rewarded) losing positions, the positions leading to losing positions, etc., so it can 
try to draw out losing games. For instance, in chess, avoiding being check mated 
for 20, 30, 40 moves against a master is already quite an achievement. At this 
ability stage, AI,^ should be able to win some games by luck, or speculate about a 
symmetry in the game that check mating the opponent will be positively rewarded. 
Once having found out the complete rules (moves and goal), AI^ will right away 
reason that playing minimax is best, and henceforth beat all grandmasters. 

If a (complex) game cannot be learned in this way in a realistic number of 
cycles, one has to provide more feedback. This could be achieved by intermediate 
help during the game. The environment could give positive (negative) feedback for 
every good (bad) move the agent makes. The demand on whether a move is to be 
valuated as good should be adapted to the gained experience of the agent in such a 
way that approximately the better half of the moves are valuated as good and the 
other half as bad, in order to maximize the information content of the feedback. 

For more complicated games like chess, even more feedback may be necessary 
from a practical point of view. One way to increase the feedback far beyond a 
few bits per cycle is to train the agent by teaching it good moves. This is called 
supervised learning. Despite the fact that the AI/z model has only a reward feedback 
Tfc, it is able to learn supervised, as will be shown in Section [5^ Another way would 
be to start with more simple games containing certain aspects of the true game and 
to switch to the true game when the agent has learned the simple game. 

No other difficulties are expected when going from fi to ^. Eventually ^"^^ will 
converge to the minimax strategy fi^^. In the more realistic case, where the envi- 
ronment is not a perfect minimax player, AI,^ can detect and exploit the weakness 
of the opponent. 

Finally, we want to comment on the input /output space X /y oi the AI models. 
In practical applications, 3^ will possibly include also illegal moves. If y is the set 
of moves of, e.g. a robotic arm, the agent could move a wrong figure or even knock 
over the figures. A simple way to handle illegal moves t/k is by interpreting them as 
losing moves, which terminate the game. Further, if, e.g. the input Xk is the image 
of a video camera which makes one shot per move, X is not the set of moves by the 
environment but includes the set of states of the game board. The discussion in this 
section handles this case as well. There is no need to explicitly design the systems 
I/O space X/y for a specific game. 

The discussion above on the AI^ agent was rather informal for the following rea- 
son: game playing (the SG^ agent) has (nearly) the same complexity as fully general 
AI, and quantitative results for the AI^ agent are difficult (but not impossible) to 
obtain. 
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5.3 Function Minimization (FM) 

Applications/examples. There are many problems that can be reduced to 
function minimization (FM) problems. The minimum of a (real-valued) function 
f -.y ^ M over some domain 3^ or a good approximate to the minimum has to be 
found, usually with some limited resources. 

One popular example is the traveling salesman problem (TSP). y is the set of 
different routes between towns, and f{y) the length of route yEy. The task is to 
find a route of minimal length visiting all cities. This problem is NP hard. Getting 
good approximations in limited time is of great importance in various applications. 
Another example is the minimization of production costs (MFC), e.g. of a car, un- 
der several constraints, y is the set of all alternative car designs and production 
methods compatible with the specifications and f{y) the overall cost of alternative 
y&y- A related example is finding materials or (bio) molecules with certain prop- 
erties (MAT), e.g. solids with minimal electrical resistance or maximally efficient 
chlorophyll modifications, or aromatic molecules that taste as close as possible to 
strawberry. We can also ask for nice paintings (NFT). y is the set of all existing or 
imaginable paintings, and f{y) characterizes how much person A likes painting y. 
The agent should present paintings which A likes. 

For now, these are enough examples. The TSF is very rigorous from a mathe- 
matical point of view, as /, i.e. an algorithm of /, is usually known. In principle, 
the minimum could be found by exhaustive search, were it not for computational 
resource limitations. For MFC, / can often be modeled in a reliable and sufficiently 
accurate way. For MAT you need very accurate physical models, which might be 
unavailable or too difficult to solve or implement. For NFT all we have is the 
judgement of person A on every presented painting. The evaluation function / can- 
not be implemented without scanning A's brain, which is not possible with today's 
technology. 

So there are different limitations, some depending on the application we have 
in mind. An implementation of / might not be available, / can only be tested 
at some arguments y and f{y) is determined by the environment. We want to 
(approximately) minimize / with as few function calls as possible or, conversely, 
find an as close as possible approximation for the minimum within a fixed number 
of function evaluations. If / is available or can quickly be inferred by the agent 
and evaluation is quick, it is more important to minimize the total time needed to 
imagine new trial minimum candidates plus the evaluation time for /. As we do 
not consider computational aspects of AI^ till Section [6] we concentrate on the first 
case, where / is not available or dominates the computational requirements. 

The greedy model. The FM model consists of a sequence yiZiy2Z2... where yu is 
a trial of the FM agent for a minimum of / and Zk = f{yk) is the true function value 
returned by the environment. We randomize the model by assuming a probability 
distribution /i(/) over the functions. There are several reasons for doing this. We 
might really not know the exact function /, as in the NFT example, and model our 
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uncertainty by the probability distribution ji. What is more important, we want 
to parallel the other AI classes, like in the SP// model, where we always started 
with a probability distribution ^ that was finally replaced by ^ to get the universal 
Solomonofi^ prediction SP^. We want to do the same thing here. Further, the 
probabilistic case includes the deterministic case by choosing = (5//o, where /o 
is the true function. A final reason is that the deterministic case is trivial when 
and hence /o are known, as the agent can internally (virtually) check all fimction 
arguments and output the correct minimum from the very beginning. 

We assume that y is countable and that is a discrete measure, e.g. by taking 
only computable functions. The probability that the function values of yi,...,yn are 
zi,...,Zn is then given by 

/^™(2/l^l...2/n^n) E l^if) (37) 

We start with a model that minimizes the expectation of the function value / for 
the next output i/k, taking into account previous information: 

yk ■■= argmmJ2zk-IJ'iyiZi...yk-iZk-iykZk) 

This type of greedy algorithm, just minimizing the next feedback, was sufficient for 
sequence prediction (SP) and is also sufficient for classification (CF, not described 
here). It is, however, not sufficient for function minimization as the following exam- 
ple demonstrates. 

Take / : {0,1} {1,2,3,4}. There are 16 different functions which shall be 
equiprobable, /i(/) = ^. The function expectation in the first cycle 

{z,) E^r/^fei) = iE-^i = i(l+2 + 3+4) = 2.5 

is just the arithmetic average of the possible function values and is independent of 
yi. Therefore, yi = 0, if we define argmin to take the lexicographically first minimum 
in an ambiguous case like here. Let us assume that /o(0) —2, where /o is the true 
environment function, i.e. Zi — 2. The expectation of Z2 is then 

(^2) E^2-/^(02y2l2) = 

For 1/2 = the agent already knows /(0) = 2, for 1/2 = 1 the expectation is, again, the 
arithmetic average. The agent will again output ^2 = with feedback Z2 — 2. This 
will continue forever. The agent is not motivated to explore other y's as /(O) is 
already smaller than the expectation of /(I). This is obviously not what we want. 
The greedy model fails. The agent ought to be inventive and try other outputs when 
given enough time. 

The general reason for the failure of the greedy approach is that the information 
contained in the feedback Zk depends on the output y^- A FM agent can actively 



2 for = 
2.5 for y2 — i 
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influence the knowledge it receives from the environment by the choice in y^. It 
may be more advantageous to flrst collect certain knowledge about / by an (in 
greedy sense) nonoptimal choice for y^, rather than to minimize the Zk expectation 
immediately. The nonminimality of Zk might be overcompensated in the long run 
by exploiting this knowledge. In SP, the received information is always the current 
bit of the sequence, independent of what SP predicts for this bit. This is why a 
greedy strategy in the SP case is already optimal. 

The general FM/x/.^ model. To get a useful model we have to think more carefully 
about what we really want. Should the FM agent output a good minimum in the 
last output in a limited number of cycles m, or should the average of the 2:1,..., 2;^ 
values be minimal, or does it suffice that just one of the z is as small as possible? 
The subtle and important differences between these settings have been analyzed and 
discussed in detail in |HutOOl IHut04] . In the following we concentrate on minimizing 
the average, or equivalently the sum of function values. We define the FM/i model 
as to minimize the sum zi + ...+Zm- Building the /i average by summation over the Zi 
and minimizing w.r.t. the Ui has to be performed in the correct chronological order. 
With a similar reasoning as in ([7]) to ffTTj) we get 

^ argmin V...min V(2;i+ ... +2;„)-/i(yii;i...?/fc-i4-i?/fe^fc---l/m^m) (38) 

By construction, the FM/z model guarantees optimal results in the usual sense that 
no other model knowing only /i can be expected to produce better results. The 
interesting case (in AI) is when /i is unknown. We define for this case, the FM^ 
model by replacing n{f) with some ^(/), which should assign high probability to 
functions / of low complexity. So we might define ^{f)=Y.q-yfx[u{qx)=f{x)]'^~^^''^- The 
problem with this definition is that it is, in general, undecidable whether a TM q is 
an implementation of a function /. ^(/) defined in this way is uncomputable, not 
even approximable. As we only need a ^ analogous to the l.h.s. of ( 1371) . the following 
definition is natural 

e™(yizi...y„^„,) := E 2-^(^) (39) 

^™ is actually equivalent to inserting the uncomputable C,{f) into (1371) . One can 
show that C,^^ is an enumerable semimeasure and dominates all enumerable proba- 
bility distributions of the form fl371) . 

Alternatively, we could have constrained the sum in ( l39l) by q{yi...yn) = Zi...Zn 
analogous to (1211) . but these two definitions are not equivalent. Definition (139|) 
ensures the symmetr}0 in its arguments and ^™{---yz...yz^ ...) = for z^z' . It in- 
corporates all general knowledge we have about function minimization, whereas (pTj) 
does not. But this extra knowledge has only low information content (complexity 



"'^^See |Sol99j for a discussion on symmetric universal distributions on unordered data. 
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of 0(1)), so we do not expect FM^ to perform much worse when using (!2T!) instead 
of (!39|) . But there is no reason to deviate from (139|1 at this point. 

We can now define a loss L™'^ as ( !38l) with = 1 and argmin^^^ replaced by min^^j 
and, additionally, fi replaced by ^ for L™^. We expect \L™^—L™'^\ to be bounded 
in a way that justifies the use of ^ instead of fi for computable fj,, i.e. computable /o 
in the deterministic case. The arguments are the same as for the AI^ model. 

In [HutOOl IHut04j it has been proven that FM^ is inventive in the sense that 
it never ceases searching for minima, but will test all y E y if y is finite (and an 
infinite set of different y's if y is infinite) for sufficiently large horizon m. There are 
currently no rigorous results on the quality of the guesses, but for the FM/i agent 
the guesses are optimal by definition. If K{^) for the true distribution /i is finite, 
we expect the FM^ agent to solve the 'exploration versus exploitation' problem in 
a universally optimal way, as ^ converges rapidly to fi. 

Using the AI Models for Function Mininimization. The AI models can be 
used for function minimization in the following way. The output yt of cycle is a 
guess for a minimum of /, like in the FM model. The reward should be high 
for small function values Zk = f{yk)- The choice rk = —Zk for the reward is natural. 
Here, the feedback is not binary but Vk^TlcM, with 71 being a countable subset 
of M, e.g. the computable reals or all rational numbers. The feedback Ok should be 
the function value fiyk)- As this is already provided in the rewards we could 
set Ofc = e as in Section 15.11 For a change and to see that the choice really does not 
matter we set Ok = Zk here. The AI/x prior probability is 

AT/ \ f ^J'™iVlZ^■..ynZr,) forrfc = — Zfc, Ok = Zk, Xk = r^Ok , ,r^\ 

f^^\ym...ynx^) = I ^ ^^'-J ^"-"^ ^^^ ^ (40) 

Inserting this into (fTO!) with mk = m one can show that yk^ = yk^, where y™ has 
been defined in ( 138|) . The proof is very simple since the FM model has already a 
rather general structure, which is similar to the full AI model. 

We expect no problem in going from FM^ to AI^. The only thing the AI^ model 
has to learn, is to ignore the o feedbacks as all information is already contained in 
r. This task is simple as every cycle provides one data point for a simple function 
to learn. 

Remark on TSP. The Traveling Salesman Problem (TSP) seems to be trivial in 
the Al/i model but nontrivial in the AI^ model, because fl38l) just implements an 
internal complete search, as fi{f) = SfjTSP contains all necessary information. AI/x 
outputs, from the very beginning, the exact minimum of f"^^^. This "solution" is, of 
course, unacceptable from a performance perspective. As long as we give no efficient 
approximation ^'^ of ^, we have not contributed anything to a solution of the TSP 
by using AI^*^. The same is true for any other problem where / is computable and 
easily accessible. Therefore, TSP is not (yet) a good example because all we have 
done is to replace an NP complete problem with the uncomputable AI^ model or 
by a computable AI^'^ model, for which we have said nothing about computation 
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time yet. It is simply an overkill to reduce simple problems to AI^. TSP is a simple 
problem in this respect, until we consider the AI^'^ model seriously. For the other 
examples, where / is inaccessible or complicated, an AI,^'^ model would provide a 
true solution to the minimization problem as an explicit definition of / is not needed 
for AIi^ and AI^'^. A computable version of AIi^ will be defined in Section |6l 

5.4 Supervised Learning from Examples (EX) 

The developed AI models provide a frame for reinforcement learning. The envi- 
ronment provides feedback r, informing the agent about the quality of its last (or 
earlier) output y\ it assigns reward r to output y. In this sense, reinforcement 
learning is explicitly integrated into the Alyu/^ models. AI/x maximizes the true 
expected reward, whereas the AI^ model is a universal, environment-independent 
reinforcement learning algorithm. 

There is another type of learning method: Supervised learning by presentation 
of examples (EX). Many problems learned by this method are association problems 
of the following type. Given some examples oG-RcC, the agent should reconstruct, 
from a partially given o', the missing or corrupted parts, i.e. complete o' to o such 
that relation R contains o. In many cases, O consists of pairs (^,f), where v is the 
possibly missing part. 

Applications/examples. Learning functions by presenting {zj{z)) pairs and ask- 
ing for the function value of z by presenting (2,?) falls into the category of supervised 
learning from examples, e.g. f{z) may be the class label or category of z. 

A basic example is learning properties of geometrical objects coded in some way. 
For instance, if there are 18 different objects characterized by their size (small or 
big), their colors (red, green, or blue) and their shapes (square, triangle, or circle), 
then {object,property)ER if the object possesses the property. Here, i? is a relation 
that is not the graph of a single-valued function. 

When teaching a child by pointing to objects and saying "this is a tree" or 
"look how green" or "how beautiful" , one establishes a relation of {object, property) 
pairs in R. Pointing to a (possibly different) tree later and asking "what is this ?" 
corresponds to a partially given pair {object, 7), where the missing part "?" should 
be completed by the child saying "tree" . 

A final example we want to give is chess. We have seen that, in principle, chess 
can be learned by reinforcement learning. In the extreme case the environment 
only provides reward r = 1 when the agent wins. The learning rate is probably 
inacceptable from a practical point of view, due to the low amount of information 
feedback. A more practical method of teaching chess is to present example games in 
the form of sensible {board- state, move) sequences. They contain information about 
legal and good moves (but without any explanation). After several games have 
been presented, the teacher could ask the agent to make its own move by presenting 
{board- state,!) and then evaluate the answer of the agent. 
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Supervised leeirning with the Alji/^ model. Let us define the EX model as 

follows: The environment presents inputs Ok-i = z^Vk = {zkiVk) € /?U(Zx {?}) C 
Zx{yvj{l}) = to the agent in cycle k — 1. The agent is expected to output yk 
in the next cycle, which is evaluated with rfc = l if {zk,yk) and otherwise. To 
simplify the discussion, an output yk is expected and evaluated even when ^^(7^?) is 
given. To complete the description of the environment, the probability distribution 
/XR( oi...On ) of the examples and questions Oj (depending on R) has to be given. 
Wrong examples should not occur, i.e. should be if Oj ^i?U(^ x {?}) for some 
\<i<n. The relations R might also be probability distributed with a{K). The 
example prior probability in this case is 

K oi-On ) = y"^/XR( 0i...0n )-0-(E) (41) 
R 

The knowledge of the valuation on output yk restricts the possible relations R, 
consistent with R{zk,yk) = rk, where R{z,y) :— 1 if {z^y) G -R and otherwise. The 
prior probability for the input sequence xi...Xn if the output sequence of AI/x is 
yi...yn, is therefore 

l^^^{yiXi...ynXn) = J2 I^Ri 0l---0n )-Cr{R) 

R:\/l<i<n[R{z„y,)=ri] 

where Xi = riOi and Oj_i = ZjWj with fjG3^U{?}. In the 1/0 sequence yiXiy2X2--- = 
l/iri2;2V22/2^2^3^3--- the yiVi are dummies, after that regular behavior starts with 
example (2:2 ,■^2)- 

The AI// model is optimal by construction of //^^ For computable prior jiR and 
CT, we expect a near-optimal behavior of the universal Al^ model if addition- 
ally satisfies some separability property. In the following, we give some motivation 
why the Al^ model takes into account the supervisor information contained in the 
examples and why it learns faster than by reinforcement. 

We keep R fixed and assume yUR(oi...o„) =yUR(oi) •...•//ij(o„) 0<^^ Oj e it!U(2^ x 
{?}) Vi to simplify the discussion. Short codes q contribute most to ^^^{yiX_i...ynXn)- 
As Oi...o„ is distributed according to the computable probability distribution /ir, a 
short code of Oi...o„ for large enough n is a Huffman code w.r.t. the distribution 
^R. So we expect and hence it! to be coded in the dominant contributions to 

in some way, where the plausible assumption was made that the y on the input 
tape do not matter. Much more than one bit per cycle will usually be learned, i.e. 
relation R will be learned in n<^K{R) cycles by appropriate examples. This coding 
of i? in g evolves independently of the feedbacks r. To maximize the feedback r^, 
the agent has to learn to output a yk with {zk,yk) £ R- The agent has to invent a 
program extension q' to q, which extracts Zk from Ok-i — {zk,'^) and searches for and 
outputs a yk with {zk,yk) &R- As R is already coded in q, q' can reuse this coding 
of R in q. The size of the extension q' is, therefore, of order 1. To learn this q', the 
agent requires feedback r with information content 0{l)=K{q') only. 

Let us compare this with reinforcement learning, where only Ok-i — {zk,'^) pairs 
are presented. A coding of in a short code q for Oi...o„ is of no use and will 
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therefore be absent. Only the rewards r force the agent to learn R. q' is therefore 
expected to be of size K{R). The information content in the r's must be of the 
order K{R). In practice, there are often only very few rk = l at the beginning of the 
learning phase, and the information content in ri...r„ is much less than n bits. The 
required number of cycles to learn R by reinforcement is, therefore, at least but in 
many cases much larger than K{R). 

Although Al^ was never designed or told to learn supervised, it learns how to 
take advantage of the examples from the supervisor, /ir and R are learned from the 
examples; the rewards r are not necessary for this process. The remaining task of 
learning how to learn supervised is then a simple task of complexity 0(1), for which 
the rewards r are necessary. 

5.5 Other Aspects of Intelligence 

In AI, a variety of general ideas and methods have been developed. In the previous 
subsections, we saw how several problem classes can be formulated within AI^. As 
we claim universality of the AI^ model, we want to illuminate which of and how 
the other AI methods are incorporated in the AI,^ model by looking at its structure. 
Some methods are directly included, while others are or should be emergent. We do 
not claim the following list to be complete. 

Probability theory and utility theory are the heart of the Al/i/^ models. The prob- 
ability ^ is a universal belief about the true environmental behavior fi. The utility 
function is the total expected reward, called value, which should be maximized. 
Maximization of an expected utility function in a probabilistic environment is usu- 
ally called sequential decision theory, and is explicitly integrated in full generality in 
our model. In a sense this includes probabilistic (a generalization of deterministic) 
reasoning, where the objects of reasoning are not true and false statements, but 
the prediction of the environmental behavior. Reinforcement Learning is explicitly 
built in, due to the rewards. Supervised learning is an emergent phenomenon (Sec- 
tion [5]1]). Algorithmic information theory leads us to use ^ as a universal estimate 
for the prior probability /i. 

For horizon > 1, the expectimax series in fllOj) and the process of selecting maxi- 
mal values may be interpreted as abstract planning. The expectimax series is a form 
of informed search, in the case of Al/i, and heuristic search, for AI,^, where ^ could 
be interpreted as a heuristic for /x. The minimax strategy of game playing in case 
of Al/i is also subsumed. The AI^ model converges to the minimax strategy if the 
environment is a minimax player, but it can also take advantage of environmental 
players with limited rationality. Problem solving occurs (only) in the form of how 
to maximize the expected future reward. 

Knowledge is accumulated by AI^ and is stored in some form not specified further 
on the work tape. Any kind of information in any representation on the inputs y is 
exploited. The problem of knowledge engineering and representation appears in the 
form of how to train the AI^ model. More practical aspects, like language or image 
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processing, have to be learned by AI^ from scratch. 

Other theories, hke fuzzy logic, possibility theory, Dempster- Shafer theory, ... are 
partly outdated and partly reducible to Bayesian probability theory |Che85l IChe88] . 
The interpretation and consequences of the evidence gap g ■=l — J2xk^{lP^<kl(Lk)>^ 
in ^ may be similar to those in Dempster- Shafer theory. Boolean logical reasoning 
about the external world plays, at best, an emergent role in the AI^ model. 

Other methods that do not seem to be contained in the AI^ model might also 
be emergent phenomena. The AI^ model has to construct short codes of the en- 
vironmental behavior, and AlXItZ (see next section) has to construct short action 
programs. If we would analyze and interpret these programs for realistic environ- 
ments, we might find some of the unmentioned or unused or new AI methods at 
work in these programs. This is, however, pure speculation at this point. More 
important: when trying to make Al^ practically usable, some other AI methods, 
like genetic algorithms or neural nets, especially for I/O pre/postprocessing, may 
be useful. 

The main thing we wanted to point out is that the AI^ model does not lack 
any important known property of intelligence or known AI methodology. What 
is missing, however, are computational aspects, which are addressed in the next 
section. 

6 Time-Bounded AIXI Model 

Until now, we have not bothered with the non-computability of the universal prob- 
ability distribution ^. As all universal models in this paper are based on ^, they are 
not effective in this form. In this section, we outline how the previous models and 
results can be modified/generalized to the time-bounded case. Indeed, the situation 
is not as bad as it could be. ^ is enumerable and i/k is still approximable, i.e. there 
exists an algorithm that will produce a sequence of outputs eventually converging 
to the exact output i/k, but we can never be sure whether we have already reached 
it. Besides this, the convergence is extremely slow, so this type of asymptotic com- 
putability is of no direct (practical) use, but will nevertheless be important later. 

Let p be a program that calculates within a reasonable time t per cycle, a rea- 
sonable intelligent output, i.e. p(i<fc) =?/i:fc. This sort of computability assumption, 
that a general-purpose computer of sufficient power is able to behave in an intelli- 
gent way, is the very basis of AI, justifying the hope to be able to construct agents 
that eventually reach and outperform human intelligence. For a contrary viewpoint 
see |Luc61t IPenSQj IPen94j . It is not necessary to discuss here what is meant by 'rea- 
sonable time/intelligence' and 'sufficient power'. What we are interested in, in this 
section, is whether there is a computable version AIXK of the AI^ agent that is su- 
perior or equal to any p with computation time per cycle of at most t. By 'superior', 
we mean 'more intelligent', so what we need is an order relation for intelligence, like 
the one in Definition [TOl 
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The best result we could think of would be an AIXK with computation time 
< t at least as intelligent as any p with computation time < t. If AI is possible at 
all, we would have reached the final goal: the construction of the most intelligent 
algorithm with computation time <t. Just as there is no universal measure in the 
set of computable measures (within time t), neither may such an AIXK exist. 

What we can realistically hope to construct is an AIXK agent of computation 
time c-t per cycle for some constant c. The idea is to run all programs p of length 
<l: = i{p) and time <t per cycle and pick the best output. The total computation 
time is c-t with c = 2'. This sort of idea of 'typing monkeys' with one of them 
eventually writing Shakespeare, has been applied in various forms and contexts in 
theoretical computer science. The realization of this best vote idea, in our case, is not 
straightforward and will be outlined in this section. A related idea is that of basing 
the decision on the majority of algorithms. This 'democratic vote' idea was used in 
|LW94t IVov92] for sequence prediction, and is referred to as 'weighted majority'. 

6.1 Time-Limited Probability Distributions 

In the literature one can find time-limited versions of Kolmogorov complexity |Dal73l 
IDal77l IKo86j and the time-limited universal semimeasure [LV9H ILV97[ ISch02] . In 
the following, we utilize and adapt the latter and see how far we get. One way 
to define a time-limited universal chronological semimeasure is as a mixture over 
enumerable chronological semimeasures computable within time t and of size at 
most /. 

:= E 2-'^'^Pimi:n) (42) 

One can show that reduces to defined in for t,/— >oo. Let us assume that 
the true environmental prior probability /i^^ is equal to or sufficiently accurately 
approximated by a p with l{p) < I and t{p) <t with t and / of reasonable size. 
There are several AI problems that fall into this class. In function minimization of 
Section 15.31 the computation of / and fi™ are often feasible. In many cases, the 
sequences of Section ISTD that should be predicted, can be easily calculated when /i^^ 
is known. In a classification problem, the probability distribution /i*"^, according 
to which examples are presented, is, in many cases, also elementary. But not all 
AI problems are of this 'easy' type. For the strategic games of Section 15.21 the 
environment itself is usually a highly complex strategic player with a /x^*^ that is 
difficult to calculate, although one might argue that the environmental player may 
have limited capabilities too. But it is easy to think of a difficult-to-calculate physical 
(probabilistic) environment like the chemistry of biomolecules. 

The number of interesting applications makes this restricted class of AI prob- 
lems, with time- and space-bounded environment /i*', worthy of study. Superscripts 
to a probability distribution except for indicate their length and maximal com- 
putation time. defined in fH21) . with a yet to be determined computation time. 



Universal Algorithmic Intelligence 



49 



multiplicatively dominates all /i*' of this type. Hence, an AI^*' model, where we use 
as prior probability, is universal, relative to all AlyU*' models in the same way 
as AI^ is universal to AI/i for all enumerable chronological semimeasures fi. The 
argmaxj/j, in (l22l) selects a i/k for which has the highest expected utility Vkm,,, 

where is the weighted average over the p^; i.e. output y^^^ is determined by a 
weighted majority. We expect AI,^*' to outperform all (bounded) AIp*^, analogous 
to the unrestricted case. 

In the following we analyze the computability properties of and AI^*', i.e. of 

y^^^ . To compute according to the definition fH2l) we have to enumerate all 
chronological enumerable semimeasures p*' of length <l and computation time <t. 
This can be done similarly to the unbounded case as described in |LV97t IHutOOl 
IHut04] . All 2^ enumerable functions of length </, computable within time t have to 
be converted to chronological probability distributions. For this, one has to evaluate 
each function for \X\-k different arguments. Hence, is computable within timj^ 

i{^^{l&i:k)) = 0{\X\-k-2''-i). The computation time of y^^^ depends on the size of 
X, y and mfc. has to be evaluated j}^!^'' lA'l^'' times in f l22|) . It is possible to 
optimize the algorithm and perform the computation within time 

t{yf^") = 0{\y\'"'\X\'"'-2'-i) (43) 

per cycle. If we assume that the computation time of /i*' is exactly i for all 
arguments, the brute-force time t for calculating the sums and maxs in flTTl) is 

t{yt^^") > \y\'"'\X\^''-i. Combining this with (03]), we get 

t{yr) = 0(2'.t(yf-")) 

This result has the proposed structure, that there is a universal AI^*' agent with 
computation time 2' times the computation time of a special AI/x*' agent. 

Unfortunately, the class of AI/x*' systems with brute-force evaluation of yk accord- 
ing to (iTTl) is completely uninteresting from a practical point of view. For instance, 
in the context of chess, the above result says that the AI^*' is superior within time 
2'-t to any brute-force minimax strategy of computation time t. Even if the factor of 
2' in computation time would not matter, the AI^*' agent is, nevertheless practically 
useless, as a brute-force minimax chess player with reasonable time t is a very poor 
player. 

Note that in the case of binary sequence prediction {h^ = 1, |3^| = | A"! = 2) the 
computation time of p coincides with that of y^^'' within a factor of 2. The class 
Alp*' includes all non- incremental sequence prediction algorithms of length <l and 
computation time <t/2. By non- incremental, we mean that no information of 
previous cycles is taken into account for speeding up the computation of yk of the 
current cycle. 



We assume that a (Turing) machine can be simulated by another in hnear time. 
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The shortcomings (mentioned and unmentioned ones) of this approach are cured 
in the next subsection by deviating from the standard way of defining a time- 
bounded ^ as a sum over functions or programs. 

6.2 The Idea of the Best Vote Algorithm 

A general agent is a chronological program p{x^k) = yi:k- This form, introduced in 
Section \2A\ is general enough to include any AI system (and also less intelligent 
systems) . In the following, we are interested in programs p of length < / and com- 
putation time <t per cycle. One important point in the time-limited setting is that 
p should be incremental, i.e. when computing yk in cycle k, the information of the 
previous cycles stored on the work tape can be reused. Indeed, there is probably no 
practically interesting, non-incremental AI system at all. 

In the following, we construct a policy p*, or more precisely, policies pi for every 
cycle k that outperform all time- and length-limited AI systems p. In cycle k, pi 
runs all 2' programs p and selects the one with the best output yk- This is a 'best 
vote' type of algorithm, as compared to the 'weighted majority' type algorithm of 
the last subsection. The ideal measure for the quality of the output would be the 
^-expected future reward 

Vktiifi<k) := E ^-'^'^VkZ , V,Z := r{xl^) + ...+r{x^r^) (44) 

The program p that maximizes V^^ should be selected. We have dropped the nor- 
malization Af unlike in (^^, as it is independent of p and does not change the order 
relation in which we are solely interested here. Furthermore, without normalization, 
V'^*^(?)r<fc) : = maXpgpV';!^(?/r<fc) is enumerable, which will be important later. 

6.3 Extended Chronological Programs 

In the functional form of the AI^ model it was convenient to maximize Vkmk over all 
p&Pk, i.e. all p consistent with the current history yx^k- This was not a restriction, 
because for every possibly inconsistent program p there exists a program p' G Pk 
consistent with the current history and identical to p for all future cycles > k. For 
the time-limited best vote algorithm p* it would be too restrictive to demand pEPk- 
To prove universality, one has to compare all 2' algorithms in every cycle, not just 
the consistent ones. An inconsistent algorithm may become the best one in later 
cycles. For inconsistent programs we have to include the i/k into the input, i.e. 
p{w^<k)=yi;k with yi^yf possible. For pEPk this was not necessary, as p knows the 
output yk = yk ill this case. The in the definition of Vkm are the rewards emerging 
in the I/O sequence, starting with ifc^k (emerging from p*) and then continued by 
applying p and q with yi: = y^ for i>k. 

Another problem is that we need Vkmh to select the best policy, but unfortunately 
Vkruk is uncomputable. Indeed, the structure of the definition of Vkm,, is very similar 
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to that of ijki hence a brute-force approach to approximate Vkm^ requires too much 
computation time as for yk- We solve this problem in a similar way, by supplementing 
each p with a program that estimates Vkmk by w^. within time t. We combine the 
calculation of and and extend the notion of a chronological program once 
again to 

p{w<k) = wlyl...wlyl (45) 
with chronological order WiyiyiXiW2y2y2i2 - ■ ■ ■ 

6.4 Valid Approximations 

Policy p might suggest any output y^ but it is not allowed to rate it with an arbi- 
trarily high if we want w^. to be a reliable criterion for selecting the best p. We 
demand that no policy is allowed to claim that it is better than it actually is. We 
define a (logical) predicate VA{p) called valid approximation, which is true if and 
only if p always satisfies wl<V^^^, i.e. never overrates itself. 

VA{p) = \;^k\/wlylyixi...wlyl\p{ifi:<k)=wlyl...wlyl^ wl<V^^^^^ (46) 

In the following, we restrict our attention to programs p, for which VA(p) can be 
proven in some formal axiomatic system. A very important point is that V^^^ is 
enumerable. This ensures the existence of sequences of programs pi,p2,P3,... for 
which VA(pj) can be proven and limi^oowl' = V^^^ for all k and all I/O sequences. 
Pi may be defined as the naive (nonhalting) approximation scheme (by enumeration) 
of V^^^ terminated after i time steps and using the approximation obtained so far 

for wl' together with the corresponding output y^' . The convergence w^' ^km^ 
ensures that V^^^, which we claimed to be the universally optimal value, can be 
approximated by p with provable VA{p) arbitrarily well, when given enough time. 
The approximation is not uniform in k, but this does not matter as the selected p 
is allowed to change from cycle to cycle. 

Another possibility would be to consider only those p that check wl<V^^^ online 
in every cycle, instead of the pre-check VA(p), either by constructing a proof (on the 
work tape) for this special case, or < V^^^ is already evident by the construction 
of w^.. In cases where p cannot guarantee wl<V^^^ it sets Wk = and, hence, trivially 
satisfies < V^^^ . On the other hand, for these p it is also no problem to prove 
VA(p) as one has simply to analyze the internal structure of p and recognize that p 
shows the validity internally itself, cycle by cycle, which is easy by assumption on 
p. The cycle-by- cycle check is therefore a special case of the pre-proof of VAi^p). 

6.5 Effective Intelligence Order Relation 

In Section 14.11 we introduced an intelligence order relation y on AI systems, based 
on the expected reward V^^^. In the following we need an order relation based 
on the claimed reward w^. which might be interpreted as an approximation to ^. 
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Definition 13 (Effective intelHgence order relation) We call p effectively 
more or equally intelligent than p' if 

p{wc<k)=wi* ...Wk* Ap'iyi<k)=w[* ...w'l,* Awk>w'k, 
i.e. if p always claims higher reward estimate w than p' . 

Relation is a co-enumerable partial order relation on extended chronological 
programs. Restricted to valid approximations it orders the policies w.r.t. the quality 
of their outputs and their ability to justify their outputs with high Wk- 

6.6 The Universal Time-Bounded AlXltl Agent 

In the following, we describe the algorithm p* underlying the universal time-bounded 
AlXltl agent. It is essentially based on the selection of the best algorithms pi out 
of the time t and length / bounded p, for which there exists a proof of VA{p) with 
length <lp. 

1. Create all binary strings of length Ip and interpret each as a coding of a 
mathematical proof in the same formal logic system in which VA(-) was for- 
mulated. Take those strings that are proofs of VA{p) for some p and keep the 
corresponding programs p. 

2. Eliminate all p of length >/. 

3. Modify the behavior of all retained p in each cycle k as follows: Nothing is 
changed if p outputs some w^yl within i time steps. Otherwise stop p and 
write Wk = and some arbitrary y^ to the output tape of p. Let P be the set 
of all those modified programs. 

4. Start first cycle: k:=l. 

5. Run every p E P on extended input yi^k, where all outputs are redirected 
to some auxiliary tape: piffc^k) = Wiyi...w^y^. This step is performed in- 
crementally by adding ifiuk-i for k>l to the input tape and continuing the 
computation of the previous cycle. 

6. Select the program p with highest claimed reward ty^: : = argmaXpW^. 

7. Write yk'- = yk' the output tape. 

8. Receive input Xk from the environment. 

9. Begin next cycle: k:=k + l, goto stepO 

It is easy to see that the following theorem holds. 

Theorem 14 (Optimality of AlXIt/) Letp he any extended chronological (incre- 
mental) program like ( [T^P of length £{p) <l and computation time per cycle t{p) <t, 
for which there exists a proof of VA(p) defined in of length <lp. The algorithm 
p* constructed in the last paragraph, which depends on I, t and Ip hut not on p, is 
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effectively more or equally intelligent, according to >z'^ (see Definition [7^) than any 
such p. The size ofp* is i{p*) = 0{\og{l-i-lp)) , the setup-time is tgetupip*) =0{lp-2''P) 
and the computation time per cycle is tcydeip*) =0{2^ -i) . 

Roughly speaking, the theorem says that if there exists a computable solution to 
some or all AI problems at all, the explicitly constructed algorithm p* is such a 
solution. Although this theorem is quite general, there are some limitations and 
open questions that we discuss in the next subsection. 

The construction of the algorithm p* needs the specification of a formal logic 
system (y,X,yi,CiJi,Ri,^,A,=,...), and axioms, and inference rules. A proof is a 
sequence of formulas, where each formula is either an axiom or inferred from previous 
formulas in the sequence by applying the inference rules. Details can be found in 
|Hut02a] in a related construction or in any textbook on logic or proof theory, e.g. 
[Fit96. S ho67] . We only need to know that provability and Turing Machines can be 
formalized. The setup time in the theorem is just the time needed to verify the 2'^ 
proofs, each needing time 0{lp). 



6.7 Limitations and Open Questions 

• Formally, the total computation time of p* for cycles l...k increases linearly 
with k, i.e. is of order 0{k) with a coefficient 2'-t. The unreasonably large 
factor 2' is a well-known drawback in best / democratic vote models and will be 
taken without further comments, whereas the factor t can be assumed to be 
of reasonable size. If we do not take the limit k—>-oo but consider reasonable 
k, the practical significance of the time bound on p* is somewhat limited due 
to the additional additive constant 0(/p-2'^). It is much larger than k-2^-t as 
typically Ip > ^(VA(p)) > l{p) = I. 

• p* is superior only to those p that justify their outputs (by large wl). It 
might be possible that there are p that produce good outputs y^ within rea- 
sonable time, but it takes an unreasonably long time to justify their outputs 
by sufficiently high w^.. We do not think that (from a certain complexity level 
onwards) there are policies where the process of constructing a good output 
is completely separated from some sort of justification process. But this jus- 
tification might not be translatable (at least within reasonable time) into a 
reasonable estimate of V^^^. 

• The (inconsistent) programs p must be able to continue strategies started by 
other policies. It might happen that a policy p steers the environment to a 
direction for which p is specialized. A "foreign" policy might be able to displace 
p only between loosely connected episodes. There is probably no problem for 
factorizable fi. Think of a chess game, where it is usually very difficult to 
continue the game or strategy of a different player. When the game is over, it 
is usually advantageous to replace a player by a better one for the next game. 
There might also be no problem for sufficiently separable /x. 
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• There might be (efficient) valid approximations p for which VA(p) is true but 
not provable, or for which only a very long {>lp) proof exists. 

6.8 Remarks 

• The idea of suggesting outputs and justifying them by proving reward bounds 
implements one aspect of human thinking. There are several possible reactions 
to an input. Each reaction possibly has far-reaching consequences. Within a 
limited time one tries to estimate the consequences as well as possible. Finally, 
each reaction is valuated, and the best one is selected. What is inferior to 
human thinking is that the estimates must be rigorously proved and the 
proofs are constructed by blind exhaustive search, further, that all behaviors 
p of length < / are checked. It is inferior "only" in the sense of necessary 
computation time but not in the sense of the quality of the outputs. 

• In practical applications there arc often cases with short and slow programs 
Ps performing some task T, e.g. the computation of the digits of tt, for which 
there exist long but quick programs pi too. If it is not too difficult to prove that 
this long program is equivalent to the short one, then it is possible to prove 

K^(pi)(T) < i{ps) with being the time-bounded Kolmogorov complexity. 
Similarly, the method of proving bounds for V^m^ can give high lower 
bounds without explicitly executing these short and slow programs, which 
mainly contribute to Vkmk- 

• Dovetailing all length- and time-limited programs is a well-known elementary 
idea (e.g. typing monkeys). The crucial part that was developed here, is the 
selection criterion for the most intelligent agent. 

• The construction of AlXItl and the enumerability of T4^^ ensure arbitrary 
close approximations of V^mfc, hence we expect that the behavior of AlXlil 
converges to the behavior of AI^ in the limit f,Z,Zp— >oo, in some sense. 

• Depending on what you know or assume that a program p of size / and com- 
putation time per cycle t is able to achieve, the computable AlXltl model 
will have the same capabilities. For the strongest assumption of the existence 
of a Turing machine that outperforms human intelligence, AlXltl will do too, 
within the same time frame up to an (unfortunately very large) constant factor. 



7 Discussion 

This section reviews what has been achieved in the article and discusses some other- 
wise unmentioned topics of general interest. We remark on various topics, including 
concurrent actions and perceptions, the choice of the I/O spaces, treatment of en- 
crypted information, and peculiarities of mortal embodies agents. We continue with 
an outlook on further research. Since many ideas have already been presented in the 
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various sections, we concentrate on nontechnical open questions of general impor- 
tance, including optimality, down-scaling, implementation, approximation, elegance, 
extra knowledge, and training of/for AIXI(t/). We also include some (personal) re- 
marks on non-computable physics, the number of wisdom Q, and consciousness. As 
it should be, the article concludes with conclusions. 

7.1 General Remarks 

Game theory. In game theory |OR94] one often wants to model the situation 
of simultaneous actions, whereas the AI^ models have serial I/O. Simultaneity can 
be simulated by withholding the environment from the current agent's output i/k, 
until Xk has been received by the agent. Formally, this means that ulyx^^kWlLk) is 
independent of the last output i/k- The AI^ agent is already of simultaneous type in 
an abstract view if the behavior p is interpreted as the action. In this sense, AIXI 
is the action p* that maximizes the utility function (reward), under the assumption 
that the environment acts according to ^. The situation is different from game 
theory, as the environment ^ is not a second 'player' that tries to optimize his own 
utility (see Section [5^ . 

Input/output spaces. In various examples we have chosen differently specialized 
input and output spaces X and 3^. It should be clear that, in principle, this is 
unnecessary, as large enough spaces X and y (e.g. the set of strings of length 2^^) 
serve every need and can always be Turing-reduced to the specific presentation 
needed internally by the AIXI agent itself. But it is clear that, using a generic 
interface, such as camera and monitor for learning tic-tac-toe, for example, adds the 
task of learning vision and drawing. 

How AIXI(t/) deals with encrypted information. Consider the task of 
decrypting a message that was encrypted by a public key encrypter like RSA. A 
message m is encrypted using a product n of two large primes pi and p2, resulting in 
encrypted message c=RSA(m|n). RSA is a simple algorithm of size 0(1). If AIXI is 
given the public key n and encrypted message c, in order to reconstruct the original 
message m it only has to "learn" the function RSA~^(c|?t,) := RSA(c|pi,p2) = m. 
RSA^^ can itself be described in length 0(1), since RSA is 0(1) and pi and p2 can 
be reconstructed from n. Only very little information is needed to learn 0(1) bits. 
In this sense decryption is easy for AIXI (like TSP, see Section [5731) . The problem 
is that while RSA is efficient, RSA~^ is an extremely slow algorithm, since it has 
to find the prime factors from the public key. But note, in AIXI we are not talking 
about computation time, we are only talking about information efficiency (learning 
in the least number of interaction cycles). One of the key insights in this article 
that allowed for an elegant theory of AI was this separation of data efficiency from 
computation time efficiency. Of course, in the real world computation time matters, 
so we invented AlXItl. AlXltl can do every job as well as the best length / and 
time t bounded agent, apart from time factor 2' and a huge offset time. No practical 
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offset time is sufficient to find tlie factors of n, but in tlieory, enougli offset time 
allows also AlXltl to (once-and-for-all) find the factorization, and then, decryption 
is easy of course. 

Mortal embodied agents. The examples we gave in this article, particularly those 
in Section [5], were mainly bodiless agents: predictors, gamblers, optimizers, learn- 
ers. There are some peculiarities with reinforcement learning autonomous embodied 
robots in real environments. 

We can still reward the robot according to how well it solves the task we want it 
to do. A minimal requirement is that the robot's hardware functions properly. If the 
robot starts to malfunction its capabilities degrade, resulting in lower reward. So, 
in an attempt to maximize reward, the robot will also maintain itself. The problem 
is that some parts will malfunction rather quickly when no appropriate actions are 
performed, e.g. flat batteries, if not recharged in time. Even worse, the robot may 
work perfectly until the battery is nearly empty, and then suddenly stop its operation 
(death), resulting in zero reward from then on. There is too little time to learn how 
to maintain itself before it's too late. An autonomous embodied robot cannot start 
from scratch but must have some rudimentary built-in capabilities (which may not 
be that rudimentary at all) that allow it to at least survive. Animals survive due 
to reflexes, innate behavior, an internal reward attached to the condition of their 
organs, and a guarding environment during childhood. Different species emphasize 
different aspects. Reflexes and innate behaviors are stressed in lower animals versus 
years of safe childhood for humans. The same variety of solutions is available for 
constructing autonomous robots (which we will not detail here). 

Another problem connected, but possibly not limited to embodied agents, espe- 
cially if they are rewarded by humans, is the following: Sufficiently intelligent agents 
may increase their rewards by psychologically manipulating their human "teachers" , 
or by threatening them. This is a general sociological problem which successful AI 
will cause, which has nothing specifically to do with AlXl. Every intelligence supe- 
rior to humans is capable of manipulating the latter. In the absence of manipulable 
humans, e.g. where the reward structure serves a survival function, AIXI may di- 
rectly hack into its reward feedback. Since this is unlikely to increase its long-term 
survival, AIXI will probably resist this kind of manipulation (just as most humans 
don't take hard drugs, due to their long-term catastrophic consequences). 

7.2 Outlook &; Open Questions 

Many ideas for further studies were already stated in the various sections of the 
article. This outlook only contains nontechnical open questions regarding AlXI(tZ) 
of general importance. 

Value bounds. Rigorous proofs for non-asymptotic value bounds for AI^ are the 
major theoretical challenge - general ones, as well as tighter bounds for special 
environments /i, e.g. for rapidly mixing MDPs, and/or other performance criteria 
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have to be found and proved. Although not necessary from a practical point of 
view, the study of continuous classes Ai, restricted policy classes, and/or infinite 3^, 
X and m may lead to useful insights. 

Scaling AIXI down. A direct implementation of the AlXItl model is, at best, 
possible for small-scale (toy) environments due to the large factor 2' in computation 
time. But there are other applications of the AIXI theory. We saw in several 
examples how to integrate problem classes into the AIXI model. Conversely, one 
can downscale the AI^ model by using more restricted forms of C,- This could be 
done in the same way as the theory of universal induction was downscaled with 
many insights to the Minimum Description Length principle |LV92at IRis89] or to 
the domain of finite automata |FMG92j . The AIXI model might similarly serve as a 
supermodel or as the very definition of (universal unbiased) intelligence, from which 
specialized models could be derived. 

Implementation and approximation. With a reasonable computation time, the 
AIXI model would be a solution of AI (see the next point if you disagree). The 
AlXIt/ model was the first step, but the elimination of the factor 2' without giving 
up universality will almost certainly be a very difficult task0 One could try to select 
programs p and prove VA(p) in a more clever way than by mere enumeration, to 
improve performance without destroying universality. All kinds of ideas like genetic 
algorithms, advanced theorem provers and many more could be incorporated. But 
now we have a problem. 

Computability. We seem to have transferred the AI problem just to a different 
level. This shift has some advantages (and also some disadvantages) but does not 
present a practical solution. Nevertheless, we want to stress that we have reduced the 
AI problem to (mere) computational questions. Even the most general other systems 
the author is aware of depend on some (more than complexity) assumptions about 
the environment or it is far from clear whether they are, indeed, universally optimal. 
Although computational questions are themselves highly complicated, this reduction 
is a nontrivial result. A formal theory of something, even if not computable, is often 
a great step toward solving a problem and also has merits of its own, and AI should 
not be different in this respect (see previous item). 

Elegance. Many researchers in AI believe that intelligence is something complicated 
and cannot be condensed into a few formulas. It is more a combining of enough 
methods and much explicit knowledge in the right way. From a theoretical point 
of view we disagree, as the AIXI model is simple and seems to serve all needs. 
From a practical point of view we agree to the following extent: To reduce the 
computational burden one should provide special-purpose algorithms {methods) from 
the very beginning, probably many of them related to reduce the complexity of the 
input and output spaces X and y by appropriate pre/postprocessing methods. 

Extra knowledge. There is no need to incorporate extra knowledge from the very 



"'^^But see |Hut02aj for an elegant theoretical solution. 
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beginning. It can be presented in the first few cycles in any format. As long as the 
algorithm to interpret the data is of size 0(1), the AIXI agent will "understand" the 
data after a few cycles (see Section (531). If the environment /i is complicated but 
extra knowledge z makes K{fi\z) small, one can show that the bound ( JT7I) reduces 
roughly to In2-K{n\z) when xi=z, i.e. when z is presented in the first cycle. The 
special-purpose algorithms could be presented in xi too, but it would be cheating to 
say that no special-purpose algorithms were implemented in AIXI. The boundary 
between implementation and training is unsharp in the AIXI model. 

Training. We have not said much about the training process itself, as it is not 
specific to the AIXI model and has been discussed in literature in various forms 
and disciplines |Sol86t ISch03l ISch04] . By a training process we mean a sequence of 
simple-to-complex tasks to solve, with the simpler ones helping in learning the more 
complex ones. A serious discussion would be out of place. To repeat a truism, it is, 
of course, important to present enough knowledge Ok and evaluate the agent output 
i/k with Tk in a reasonable way. To maximize the information content in the reward, 
one should start with simple tasks and give positive reward to approximately the 
better half of the outputs yk- 

7.3 The Big Questions 

This subsection is devoted to the big questions of AI in general and the AIXI model 
in particular with a personal touch. 

On non-computable physics &; brains. There are two possible objections to AI 
in general and, therefore, to AIXI in particular. Non-computable physics (which is 
not too weird) could make Turing computable AI impossible. As at least the world 
that is relevant for humans seems mainly to be computable we do not believe that 
it is necessary to integrate non-computable devices into an AI system. The (clever 
and nearly convincing) Godel argument by Penrose [Pen89l IPen94j , refining Lucas 
|Luc61] , that non-computational physics must exist and is relevant to the brain, has 
(in our opinion convincing) loopholes. 

Evolution &; the number of wisdom. A more serious problem is the evolutionary 
information-gathering process. It has been shown that the 'number of wisdom' Q 
contains a very compact tabulation of 2" undecidable problems in its first n binary 
digits [ChaQlj . fl is only enumerable with computation time increasing more rapidly 
with n than any recursive function. The enormous computational power of evolution 
could have developed and coded something like fl into our genes, which significantly 
guides human reasoning. In short: Intelligence could be something complicated, 
and evolution toward it from an even cleverly designed algorithm of size 0(1) could 
be too slow. As evolution has already taken place, we could add the information 
from our genes or brain structure to any/our AI system, but this means that the 
important part is still missing, and that it is principally impossible to derive an 
efficient algorithm from a simple formal definition of AI. 
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Consciousness. For what is probably tlie biggest question, tliat of consciousness, 
we want to give a physical analogy. Quantum (field) theory is the most accurate and 
universal physical theory ever invented. Although already developed in the 1930s, 
the big question, regarding the interpretation of the wave function collapse, is still 
open. Although this is extremely interesting from a philosophical point of view, it 
is completely irrelevant from a practical point of viewF^ We believe the same to be 
valid for consciousness in the field of Artificial Intelligence: philosophically highly 
interesting but practically unimportant. Whether consciousness will be explained 
some day is another question. 

7.4 Conclusions 

The major theme of the article was to develop a mathematical foundation of Ar- 
tificial Intelligence. This is not an easy task since intelligence has many (often 
ill-defined) faces. More specifically, our goal was to develop a theory for rational 
agents acting optimally in any environment. Thereby we touched various scientific 
areas, including reinforcement learning, algorithmic information theory, Kolmogorov 
complexity, computational complexity theory, information theory and statistics, 
Solomonoff induction. Levin search, sequential decision theory, adaptive control the- 
ory, and many more. 

We started with the observation that all tasks that require intelligence to be 
solved can naturally be formulated as a maximization of some expected utility in 
the framework of agents. We presented a functional ([3]) and an iterative (fTT]) formu- 
lation of such a decision-theoretic agent in Section [21 which is general enough to cover 
all AI problem classes, as was demonstrated by several examples. The main remain- 
ing problem is the unknown prior probability distribution /i of the environment (s). 
Conventional learning algorithms are unsuitable, because they can neither handle 
large (unstructured) state spaces, nor do they converge in the theoretically minimal 
number of cycles, nor can they handle non-stationary environments appropriately. 
On the other hand, Solomonoff 's universal prior ^ flTBl) . rooted in algorithmic infor- 
mation theory, solves the problem of the unknown prior distribution for induction 
problems as was demonstrated in Section [31 No explicit learning procedure is neces- 
sary, as ^ automatically converges to fi. We unified the theory of universal sequence 
prediction with the decision-theoretic agent by replacing the unknown true prior /i by 
an appropriately generalized universal semimeasure ^ in Section [H We gave various 
arguments that the resulting AIXI model is the most intelligent, parameter- free and 
environmental/application-independent model possible. We defined an intelligence 
order relation (Definition [TOl) to give a rigorous meaning to this claim. Furthermore, 
possible solutions to the horizon problem have been discussed. In Section [51 we out- 
lined how the AIXI model solves various problem classes. These included sequence 
prediction, strategic games, function minimization and, especially, learning to learn 



^®In the Theory of Everything, the collapse might become of 'practical' importance and must or 
will be solved. 
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supervised. The list could easily be extended to other problem classes like classifica- 
tion, function inversion and many others. The major drawback of the AIXI model is 
that it is uncomputable, or more precisely, only asymptotically computable, which 
makes an implementation impossible. To overcome this problem, we constructed 
a modified model AlXItZ, which is still effectively more intelligent than any other 
time t and length I bounded algorithm (Section [6]). The computation time of AlXItl 
is of the order t-2K A way of overcoming the large multiplicative constant 2' was 
presented in |Hut02aj at the expense of an (unfortunately even larger) additive con- 
stant. Possible further research was discussed. The main directions could be to 
prove general and special reward bounds, use AIXI as a supermodel and explore 
its relation to other specialized models, and finally improve performance with or 
without giving up universality. 

All in all, the results show that Artificial Intelligence can be framed by an el- 
egant mathematical theory. Some progress has also been made toward an elegant 
computational theory of intelligence. 



Annotated Bibliography 

Introductory textbooks. The book of Hopcroft and UUman, and in the new revi- 
sion co-authored by Motwani |HMU01j . is a very readable elementary introduction 
to automata theory, formal languages, and computation theory. The Artificial In- 
telligence book |RN03j by Russell and Norvig gives a comprehensive overview over 
AI approaches in general. For an excellent introduction to Algorithmic Informa- 
tion Theory, Kolmogorov complexity, and Solomonoff induction one should consult 
the book of Li and Vitanyi |LV97j . The Reinforcement Learning book by Sutton 
and Barto |SB98j requires no background knowledge, describes the key ideas, open 
problems, and great applications of this field. A tougher and more rigorous book 
by Bertsekas and Tsitsiklis on sequential decision theory provides all (convergence) 
proofs |BT96j . 

Algorithmic information theory. Kolmogorov |Kol65j suggested to define the 
information content of an object as the length of the shortest program computing a 
representation of it. Solomonoff |Sol64] invented the closely related universal prior 
probability distribution and used it for binary sequence prediction |Sol64t ISol78j and 



function inversion and minimization |Sol86j . Together with Chaitin |Cha66l FChaTS] . 
this was the invention of what is now called Algorithmic Information theory. For 
further literature and many applications see |LV97j . Other interesting applications 
can be found in |Cha91t ISch99t IVW98j . Related topics are the Weighted Major- 
ity algorithm invented by Littlestone and Warmuth |LW94] . universal forecasting 
by Vovk |Vov92j . Levin search |Lev73] . PAC-learning introduced by Valiant |Val84j 
and Minimum Description Length |LV92al IRis89] . Resource-bounded complexity 
is discussed in |Dal73l IDal77t IFMG92t IK086I IPF97j . resource-bounded universal 
probability in |LV91t ILV971 ISch02] . Implementations are rare and mainly due to 
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Schmidhuber |Con97l [Sch97l ISZW971 [ScE03l ISch04j . Excellent reviews with a philo- 
sophical touch are |LV92bt ISol97] . For an older general review of inductive inference 
see Angluin |AS83] . 

Sequential decision theory. The other ingredient in our AI^ model is sequential 
decision theory. We do not need much more than the maximum expected utility 
principle and the expectimax algorithm |Mic66l IRN03j . The book of von Neumann 
and Morgenstern [NM44j might be seen as the initiation of game theory, which 
already contains the expectimax algorithm as a special case. The literature on 
reinforcement learning and sequential decision theory is vast and we refer to the 
references given in the textbooks (SB98| [BT96] . 

The author's contributions. Details on most of the issues addressed in this arti- 
cle can be found in various reports or publications or the book jHut04] by the author: 
The AI^ model was first introduced and discussed in March 2000 in |HutOOj in a 62- 
page- long report. More succinct descriptions were published in |Hut01dl[HutOTe] . 
The AI^ model has been argued to formally solve a number of problem classes, 
including sequence prediction, strategic games, function minimization, reinforce- 
ment and supervised learning [HutOOj . A variant of AI^ has recently been shown 
to be self-optimizing and Pareto optimal |Hut02b] . The construction of a general 
fastest algorithm for all well-defined problems |Hut02aj arose from the construc- 
tion of the time-bounded AlXltl model [HutOldj . Convergence |Hut03b] and tight 
|Hut03c] error [HutOlcj IHutOlaj and loss [HutOlbl IHut03aj bounds for Solomonoff 's 
universal sequence prediction scheme have been proven. Loosely related ideas on 
a market /economy-based reinforcement learner |KHS01b"] and gradient-based rein- 
forcement planner [KHSOla] were implemented. These and other papers are avail- 



able at http://www.idsia.ch/~marcus/ai 
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